 Let's get started. So today, we are going to talk about Mesos for a bit cloud needs. A little bit about me. I'm the leader and architect at DeX YP. It was YP before, but our company got acquired, so it's called DeX YP now. And I lead the microservices and DevOps initiative for the company. And I presented at various conferences about containers DevOps at the LACTO forum. Mesos gone for the past two years, and this is my third year here. And LinuxCon, Unix, Unix, and some various other internal company forums. So this is basically how we are going to divide the talk, how we are going to talk about. Don't worry, I'm not going to bore you about the cloud stuff and everything, but I just want to basically form a base so that I can move on to my main agenda. So we are going to talk about your cloud journey, your DR strategies in the traditional environment, and the cloud, and then how does hybrid cloud basically comes in, and how does Mesos solve the problem with DR? And DR is basically a disaster recovery. And what are the new workloads and their challenges and some conclusion and takeaway? And then I will basically have five minutes at the end for the Q&A. OK, so let's basically look at the cloud journey. So basically, people who are actually looking to the cloud who are coming from the on-premise world, so basically, they are either an explorer, or maybe they are actually running the Dev work test workload in cloud, or basically they are adding new apps to work in the cloud, or basically they are using some sort of cloud busting. If they need more resources on the fly and dynamically, they basically use on-demand cloud. And then some are already basically they want to move everything, everything legacy and everything onto the cloud. And people like Netflix and everybody, they're all in. So basically, so either you are in the on-premise world, but basically if you're looking into the cloud, you're basically following these four categories. We choose cloud. Again, this is very obvious. You might have heard about this from all the vendors, AWS and Google Cloud guys. It's just I'm just going to list it out. The TCO is very less, the total cost of ownership. And time to market your apps, basically, to run in the cloud is extremely fast. Basically, you can just write your apps and basically you already have the resources, and boom, it's basically out there. Then you have an aging hardware in your on-premise data center. And then you want to basically migrate or upgrade or do any maintenance. It takes a long time, which is not the case with the cloud. And then you can basically very well have better resource utilization. And then maybe you can say, OK, let me just try it before I use it. Or somebody you want to do on-demand cloud busting. For example, if they need to do more computation on the fly, they can use that. So this is, and you can just go global in minutes without having to do too much. So one of the other reasons people are actually looking into the cloud, because they want to do a better DR and a BCB planning. DR is for disaster recovery. And they want to basically have a better BCB planning for the basically application set of applications. Because to do a DR with your on-premise stuff is very expensive. So cloud provides a cheaper alternative. And then we are sort of lazy that you want to basically delegate our DR responsibility to somebody else. See, OK, if cloud is doing it, let them do it. And then I'm just going to pay them. And it's much easier. But you might be surprised to hear about the DR, how many people are actually doing the disaster recovery planning for their applications. It's very sad, actually. So three out of four companies get a failing grade for DR. They might be having a plan, but actually if you basically execute or try to test it, that's when you actually realize that the DR plan is basically all messed. And 60% of the companies who have a DR, they don't basically have a documented DR. It's basically everything is basically in people's mind. They basically know how to basically something happens, how to bring it back. What are my core people? They would basically know how to bring it back up. And 40% of them said the DR plan basically didn't work. And people who basically face the issue, 60% of these companies, they actually if they lose the data, they will close down in six months, which is basically the case. So let's look more about into disaster recovery and how we can basically tackle that best in this world. But before that, let's actually just go with what DR is, how things basically happen in traditional environment and cloud environment, and how Mesos is going to help us out in that. So basically, when you look at the disaster recovery, it's basically a list of processes or policies that basically to bring a system back up, how after any catastrophic event. And if you look at it, it all boils down to how much fault tolerant your application is and how much highly available your application is. If your application is highly fault tolerant, maybe you do not have to worry that much about DR. Actually, you will, but I will talk more about it. But it depends on the underlying platform more, right? So DR is basically finding a good balance of RTO and RPO. So RTO is basically within what time you can basically it's like a real-time objective or something, recovery time objective. It's like what is the shortest amount of time you can actually bring your application back up as far as possible. And RPO is basically a recovery point objective, basically without losing too much of data, how much data loss can you suffer before bringing your application up, right? So technically, we want to basically, disaster happens, we want to basically bring our application up as soon as possible with the minimal RTO and RPO without any data loss and as soon as possible. So remember these two metrics, RTO and RPO. This is basically what we want to shoot. This is basically what we want to get at as close to as possible. So now let's actually leave DR basically happens, disaster recovery in the traditional on-premise environment. Let's say basically like you have a DNS and then you have two data centers, one on the east coast, one on the west coast. You basically have a switch. You have replicating to data basis. So you have all of that, right? And similarly, you basically already have the same exact replica of there running in the other data center. You might not be using it, but you are basically paying for all your data center power consumption and basically your apps are already stays. They're basically about to a good to go. And your DB is basically getting replicated on to the end of the region. So this is basically what happens. This is basically what has been happening for a long time, right? You basically had two set of stuff already running. But the thing is that it's extremely rapid. For example, something to happen to the region one, region two can just take it over. You can start the backup, change the DNS and everything will be basically served from another data center in the region. So now DR has been made easy in the cloud because still all the cloud offerings, they basically provide your GTMs. Basically everything is software defined. You just go to the interface or an API. You can configure your load balancer. You can configure your database, your queues. You're basically everything. Everything is software defined, right? And mostly all the components of cloud, they are fault tolerant. And others which are not can be made fault tolerant with the right architecture. And there is no basically cost upfront cost, right? Which is basically just like provision things on the fly. And then you basically have a DR environment for you. And the backup is very extremely low cost. And again, it's like pay as you go. If you're not actually using it, you don't basically pay for it. And the recovery time is sort of smaller. So that's why people are actually looking cloud as a solution for the DR. So let's actually look at how basically things happen, right? This one is the most cheapest one. There are no upfront costs. What you do is that in another region, you just like simply on the right side, you year mark your servers that I'm going to be using. That something basically happens. And then again, you're not basically replicating the data. So it's not for the rapid recovery, but you basically have everything staged, everything configured, everything, all of it basically are up to date. And then as soon as that thing basically happens, you can just like start stuff on the other end. It's again, you're not basically paying any for it. It's like year marking it. And so that's why it's kind of cheaper. And people, this is like the most common one. This is the warm DR. It's exactly, it's sort of similar with the cold, but except for that you are doing a live replication of your database onto another region. Other things are staged, they're not running. So as soon as something basically is aware to happen to the region one, you can start the region two. They can connect to the slave database out there. Then they can start storing the traffic. It's extremely rapid. It's better than the cold DR. And there's some cost that you're basically running a database in the region two. And this is like the hot DR. This is exactly similar to what you would do in the traditional environment. You basically have everything basically up and running in both the regions, multiple zones. And you basically are doing the rival application. So again, you're not getting much, but somebody who wants to do a rapid recovery and their RTO time is like they want close to zero, that's what they would basically pay for it. If you're basically having a critical application, which is extremely important for your company, for example, financial or whatnot, you basically go with this approach. So now, this is all the good stuff. I spoke about the benefits and everything, but now let's look at some of the limitations, right? So you are basically now technically logged in, right? You basically, let's say you basically are using a database, a relational database in, for example, in Amazon, right? You basically create a volume there, map it, EBS store device, and then you basically are running the database there. So now you are technically you are using all the solutions that your cloud is actually providing. So if you were to move to another cloud vendor, it's not actually that easy. You will have to basically re-architect your app or basically do whole kind of stuff, right? And for some components, even though all your cloud offerings, they might say it's like highly available, highly fault tolerant, but still, and if you want to do plan for a DR, you cannot do that. I will just give you one example about that. I mean, you might have heard about S3, right? S3 is like extremely popular object storage device in the entire world. Most of the companies are basically are using it, but consider what actually happened. You wanted to do have a DR, but you couldn't do a DR because you relied, okay, S3 is like highly available, highly fault tolerant. So let's see what actually happened, right? So we had the big Amazon outage earlier in the year in February. So yeah, there S3 basically in one region basically went down. And actually it was like a typo in commands from one of the engineers. So that basically affected the entire Easter region, right? So basically all of these things were basically affected. For example, if you basically had your images onto Docker hub registries, you could not pull it down. Slike was affected, basically Imgur. So bunch of stuff that basically had that files or whatnot, they basically started to act up. It wasn't recoverable. And you would think that company as big as Slack or Docker or whatnot, they might have some better DR planning, right? But this is basically what happened. They relied completely on S3 and they couldn't basically got it back up. And again, not only that, lots of IoT devices, say like thermostats, light bulbs, they were impacted and they didn't know how to basically bring them back up. And again, the other thing was that since like it was using the files from the object store, even the health dashboard was showing the wrong status, right? Instead of showing the green color, it was showing red color because it was not able to pull those files out properly from the S3. So what Amazon did, they were using the Twitter handle to communicate any changes that basically they're okay, this is what my region is not available. This is how it is going to be available. So they're using the Twitter handle to communicate all of that. So to overcome that, so people have been working with the hybrid cloud stuff, right? Hybrid cloud is actually getting more and more popular now. But let's look at some of the hybrid cloud solutions that are out there in the market, right? So basically what they do is that they basically abstract the cloud offering. It's like a vending machine, for example, let's say you want to create an environment for Rails or Ruby on Rails, right? Just select the image, it's going to provision the same image across multiple cloud vendors, Amazon, Google or what not basically, right? This is like a vending machine. It's very easy if you basically use that. So they basically abstract both, they basically abstract the infrastructure as well as the software basically. Anything, if you are basically using any software as a service for any of the cloud platforms, say like any EBS store or load balancer, so they basically have abstracted that, that basically works across all those cloud offering, as well as your on-premise stuff. So basically it lets you run your workload in the multi-cloud environment, right? You can run the same workload, you can just move it around, you can run it like in your on-premise, private public, whatever you want to do. Because now you're using a unified common interface to basically your apps are really talking to. And some of the most popular ones that are out there in the market are RightScale, Scalar, VMware, there's a whole lot of people that basically are providing this sort of solutions, right? Again, like I said, like a common API and UI to manage all of your workload, provision and do orchestration and basically all of that, right? And the best thing they basically let you do is that they basically tell you how much you're using in your cloud, how much you can save money and then you are unnecessarily using this many instances which you don't have to. So they basically do all that sort of cost evaluations for you to basically take some actions. And then the other thing is that if you want to run the same workload on your on-premise, then you would have to install your cloud controllers for the solutions onto your on-premise and your apps have to be re-architected or redesigned that talks to this cloud controller. So that's another whole thing, right? So we wanted to get away from the logged in, locking in from the cloud offering, but this is exactly what we have gotten into. Now we are hybrid cloud or logged in, right? For example, if I'm using, just giving an example, I'm using this write scale API to basically to abstract everything. So now technically, now I'm basically logged into the write scale, right? So all my applications are basically talking to the API of write scale and if I just want to just take it out and replace it with something else, it's a whole big challenge, right? So this is exactly what they were trying to solve, but again, again, and the other thing is that as I said, you will have to change all your apps to basically talk to this cloud controller that you install on your on-premise so that your apps can talk with the unified API or interface with the rest. So this is like one of some of the limitations. And again, if you are using any cloud and hybrid cloud vendor, they have to keep up with all the updates from the upstream. For example, if Amazon or Google change some APIs for their block storage or load balance or whatnot, so they will have to keep up with that. And again, it's not sophisticated to run all the new kind of workload on a unified platform. You will have to do lots of stuff on your end. It's good, for example, if you want to abstract the underlying services of this cloud people, as well as the infrastructure, but if you are talking about running the new kind of workload, so you are basically on your own. I will tell what it is, right? So then what is the solution? What do I do? I mean, people are basically selling me all kinds of stuff. They're selling me cloud offerings, they're selling me hybrid cloud solutions. So what do I do, right? So how do I basically solve this issue? So when I'm using any hybrid cloud solution for a DR, I want, these are some of the things that I basically want. So I don't want any vendor-specific solution. For example, I do not, I just want to, if I want to spin up something on my own, a container, for example, I don't want to be relying on this Amazon AC2 container service. Everything should be very, very generic, very, very abstracty. I don't want to rely on any underlying cloud offerings that each and every vendor is providing. And again, software as a service is a strict no-no. I don't want to rely or basically use any of the vendor cloud offerings at all, basically. This is a strict no-no. So, but yeah, if you're convincing me about running, using cloud as an infrastructure, then yes, I want to use that because it's cheap. I can spin things up very fast. So infrastructure, IAS is a yes for me. But SAS, SAS for cloud is strict no. And then I want complete, full portability. I only want to rely on open-source tool to basically to, for everything that I basically, what they are actually offering, for basically starting with the container service, block storage, object storage, volumes, and all of that, right? You can get everything from the open-source tool. You don't have to rely on the commercial, for example, the solution. For example, and I want complete independence. I want complete independence. I can, if I want to move around from one cloud to another or one hybrid cloud vendor to another, I can very well do that. So these are some of my requirements. And I don't want any top-to-bottom approach. Most, the thing is that when I say top-to-bottom approach, it means that, for example, let's say like I'm applying a DR planning for one of my app. What I would do is that I would identify an app. I would see, okay, how many load balancer it requires, how what are the app servers, what are the DB servers. And then I would basically exactly make a copy of that. So I'm basically sort of doing a static provisioning of my resources for my app. So basically the way to, when you're looking at the DR, you're looking at from the top-to-bottom approach, right? Even though I'm not provisioning those resources, I'm here marking them that I might be using them in the future. For example, in the case of cold DR. So I don't want to, I want to get away from that. I want to, I want to basically stop looking at, for example, anything as a, I don't want to plan for anything static. I want to basically go from the bottom to up. I'll show you how it is possible. So basically, this is how we will do DR with mesos. But actually, let me just tell you a little bit about just like some terminology so that I can, when I'm going to be talking about that. In on my left side, it's a region. It's like any other regions like East Coast, West Coast or whatnot. So basically there's a concept called zones. Zones are data centers. And the data centers within a region that connected with a separate power line. They have their own separate power. It's extremely, they have a high bandwidth, maybe like with the fiber optic or whatnot. And the latency between one zone is extremely less. So think of zones as data centers in a region as the entire East Coast or West Coast, right? And the regions are connected through a public internet, for example. So let's see how we do. Remember how I want to tackle it. I want to look at from the bottom to up, not from top to bottom, right? So this is how I would do a DR with mesos, right? For example, I can have all the mesos masters running in one zone and my agents can be spread out across multiple zones, right? For example, and to make it more highly available, what I would do is that I would basically keep masters on a different regs so that if something had to happen to one rack, I mean, I would still, if I have a good quorum, a quorum of two, I would still basically maintain high availability. So let's say basically what happens like in the case of failure, right? Let's say like if your zone to basically goes down. So what would actually happen in this case, mesos is going to reschedule all your workload that is running in zone two into zone ones, provided you have enough capacity in zone one so that now technically what you have done, you basically, rather than doing the DR planning, from top to bottom, you're basically letting the platform handle that. You are not basically planning about anything that if this were to go down, I'm here marking these servers, these are my other servers that are going to run my workload, no. You're basically going looking from the bottom to top approach, right? You basically are looking from the mesos world, basically. Something had to happen, you basically, all your workload moves to, sorry, like in zone one this is basically where you're going to be running your workload, right? Now let's say like what happens when you say, I lose the zone, that is running my masters. In this case, your masters are down but your apps that are running in the zone two would still be running. Your apps will not be down. They will not get the new workload but your apps will continue to function. They will have the right load balancer in front of them, they will be serving traffic and you won't be impacted. So as soon as when your masters are basically back up, it's going to sink it back again, it will start getting more workload basically, right? So now you said, okay, I want to do it maybe even better than that. So what you can do is that you can do some other approach called multi zone masters. For example, you can spread out your masters into multiple zones. For example, zone one has master one and master two and whatnot. And then at a given point of time, one of them is a leader and they basically handle all the workload distribution and scheduling and everything. And let's say if you basically lose one zone in this setup, then master two from another zone will become the leader and it will become the slack, right? But this has not been tested thoroughly. It's still work in progress but I think this is the future, right? I mean, because like you want to, you're talking about high availability and fault tolerance and disaster recovery. So this is how you should basically look at rather than doing the static provisioning which Mesos has basically gotten us out of from, right? But especially in the terms of disaster recovery, this is how you should basically think about. Maybe like try to, if you're writing, if you're designing a center, try to work with the zones, basically identify zones in a particular region, spread out your masters there and then run your workloads. This is how you should basically architect. So some of the benefits. So now what you have done technically is that you have built the fault tolerance and high availability in the infrastructure. You are basically not relying on any of those proprietary cloud offerings that are out there. It's basically your own baby. You are basically handling on your own. You're basically letting your infrastructure do that. I mean, don't get me wrong. You still will have to do a DR but you will have to do a DR for the platform level, not at the individual application level. You will have to think about, like as I spoke about, you will have to think about how would I do a DR with zones and reasons and that kind of stuff rather than worrying about individual applications or on that level. And again, like as I said, like there are no static provisioning of any resources. You're not logged into any vendors. You are basically using everything open source tools. You're basically, you are getting all what you want. And maybe you can use cloud for backup, for example, right? And then maybe you can run a primary on your on-premise. You can very well use object storage in the cloud for your backups or database there for your snapshots or basically anything. And all the mesos components and the ecosystems is basically built or built from the open source tool. For example, if you want your Docker, if you want to use message queues, you have Kafka, if you want to run Bill Pavler and there's Jenkins already out there. So my thing is that since you are basically getting all of that thing automatically for free in the mesos ecosystem, everything. So why to basically get tied in or logged in with any of the cloud provider basically, right? For example, if you already have an on-premise, get started with this, use this open source tool and use cloud for your DR or backup and replicate content. And then you can just move stuff around, for example. And if you basically do this, you can run all kinds of workload basically. If you use this tool chain or stack, you can actually run all your traditional workload. You can run your Java apps, your web apps, your app servers, your monolithic apps, your batch processing jobs. And you can run containers, microservices, APIs, your message queues, map reduce job. And to an extent, we are running most of it like on our mesos, in our mesos world at YP basically. This is exactly, we are running all of it except for the serverless, which we are basically looking into once OpenWISC basically gets more traction. But we have a use case about using serverless architecture as well. So again, so we are using all of that and with better availability. And this is how we are using it. And we have been using this for almost like, for two years now, and I would say with extreme, high availability. We haven't had like any downtime basically, any of that. And once you actually get onto this world or this model, you would like to solve some other challenge. So this is, you should basically work on, right? If you want to look into this world, there are other stuff to basically work on other than worrying about using, getting your apps to work with the cloud commercial solutions, basically working with that. You should basically work on something like, how would I do automated discovery in this ephemeral environment? How would I inject secrets in the containers? And how would I do the config management for my containers during the runtime? How would I do centralized logings, centralized metrics? And how would I do proper isolation with respect to CPU memory? And how would I do a metering of, and all of that? And fortunately, the ecosystem has basically evolved in the last few years. And so you basically have all of that available right now. So it just, and so your magic sauce is basically trying to get all of this to work. So that you can basically run heterogeneous workload in your environment. So here is, so here is the conclusion or takeaway. Again, no matter what any cloud provider comes to you, they want to do a sales meeting and whatnot. Remember that a DCDR and BCB planning is your responsibility. If something were to happen, they will not lose customers, they will not lose the money, right? It's basically you are losing the customers. So it's basically your game, it's your responsibility. No matter what anybody says. So you have, no matter what even I say, it's basically when your app is down, you are, if it is down for an hour or two, you're basically losing the customer. So you should basically come up with the architecture that basically is going to cope up with that. And again, like as you saw, like in the example of Amazon, when if you're using S3, so you cannot simply rely and then say, okay, they already have a DRB and then it's highly available, so I don't have to care about it. No, you will have to care about it. And you should be able to pick up your solution and run it anywhere, on-premise, cloud, private, wherever and if you're using this sort of tool chain or stack, you can just run it wherever you go. You're not tied into anybody. You are just using and just consider this thing. Everything is an infrastructure. Anything about that, I don't like it. Anything basically pass or says, I don't actually like any of that. Use cloud, right? But only as for me, I would basically use it like only as an infrastructure and not says combine. So the solution is hybrid. So in the internet shell, with the help of Mesos, what you can do is that you can run your stuff with high availability and you get the flexibility of scaling with the cloud on demand and without getting logged in. And here are some of the material that I basically use for my talk. You can very well go and look at all of them. And I think that's all right. Thank you very much. Thanks for listening. I think we have a few minutes left for Q and A. Are there any caveats to running Mesos across multiple data centers in the same cluster? Or is it just that easy, like install it? So basically, I think the thing is that you can, I mean, you have to basically architect in a way that you are running. Basically, you are spreading out a master's in multiple zones, but it's not going to work between multiple regions. That thing is a strict known one because like it's using public internet, right? And then it basically, and how would Zookeeper basically do the leader election if you basically have this much traffic going back and forth, right? But if you're using the zones within the same region, they are connected with a higher bandwidth fiber optic cables so that it's better. And maybe, again, this hasn't been tested at that scale, but I think there are DCOS and these guys are basically working on it to make it more of that. So I think we will see more of that coming. But right now, if you are asking me whether we run it or to add a data point, you guys were discussing new run mesos masters across all three zones. Yeah. In the master, Zookeeper elected leader, we run five master. Perfect. It works. I would agree. Yeah. I mean, once you actually start evolving this gossip thing properly down the line, I think it might be possible down the line to have master spread out across multiple regions. I think this is basically what I would love to see going forward within the next few years, basically, so that I can have complete DR availability on my end. No more question? No? All right. Thank you very much. Thank you guys.