 Good morning. So for the first talk this morning, we're going to have David from code. I'm going to let you introduce yourself. Just a quick reminder at the end of the session, if you have questions, please let us know. Just the time to give you the mic because the session is recorded. So it's going to be good to have the other questions also recorded. Thank you. Go ahead. All right. So can you guys hear me? Perfect. Yes. All right. Cool. Awesome. So, robust applications in Mesos using external storage. So my name is David Von Thenen. I work for the code team. So the code team is a division of Dell Technologies. And it's basically the open source initiative within Dell. So we do everything Dell, it's a very large company. They do a lot of things that are open source. Our group, what we tend to focus on is container orchestrators and storage. So our group focuses on Kubernetes, Docker Swarm, and Mesos and DCOS, which is why I'm here speaking today. Before working with Mesos and DCOS, in the past life, I was working in the virtualization space. So I was working in the backup and recovery solution space and specifically working on VMware backup and recovery solutions. So here's the agenda today. So we're going to talk about Mesos storage options and what's available, kind of past, present, and then even the future. And then we're going to take a look at traditional databases and how the different storage options can affect your initial deployment of these applications and then also how they affect your day two operations and like disaster recovery and stuff like that. And then we're also going to take a look at NoSQL and key value storage. Effectively doing the same thing, we're going to walk and see how different deployment strategies with storage affect initial deployment and day two type stuff. And then we'll also do just a brief wrap up. I kind of want to just highlight the important points about the differences of the storage options that are available and what kind of deployment strategies will kind of affect, you know, the life cycle of your application. So just jump right into it. Mesos storage options. So I'm going to structure this session a little differently. So a lot of presentations you'll see will kind of define the problem and then kind of build up to what difficulties users and administrators have and then they kind of give you the solution at the end. I'm kind of going to switch it and do it in reverse. I'm going to give you the solutions for storage that are available there today. And the reason why I'm going to do that is because we're going to take a look at what decisions that we make in our storage options, how they affect, you know, the life cycle of your application, whether it's regards to maintenance, disaster recovery and like all the day two related stuff for your app. So if we look at the container ecosystem today, there's a good mix of applications that are transient and then there are some that are long running. Whether these applications are transient or long running, they generate data. Sometimes that data is useful and you want to persist that data. And other times, you know, that data, you can just throw it away and you don't really care. And that data can take in many different shapes and then forms. So there's user data. So like if you're running a database, you're obviously collecting data for your Postgres data. There's also the simplest things like configuration, right? So maybe you have a particular application that are running in multiple places, but that configuration varies between instance to instance. And but if you take a look at kind of containers today, I have a little snapshot right here of Docker Hub. And if you look at the top 20 apps that are in the Docker Store, 10 of those are persistent applications. So they are actually applications that you actually want to save that state or that data that's associated with that app. So it's just as some examples right here, you see Redis just looking MySQL, Mongo, Postgres, Docker Registry itself. So there's a lot of, yeah, and there's a lot of applications that are out there that require persistence. And it's becoming more and more prevalent in the container space today. And you kind of ask yourself, like, why is that important? Why do we care that we can deploy stateful applications inside containers? And the reason why is because containers have a lot of advantages for these stateful applications. So if you take a look at what containers provide, they provide a consistent mechanism. So you can deploy that application with the same environment and same consistency everywhere, no matter what form of compute that particular container lands on. And also there's dependency management, right? So if you're going to use a container, if you're going to define a container, and if it happens to be like a Docker container, that Docker file outlines all the dependencies that that container, that container, that application needs. So you're effectively defining all the packaging that's associated with your application, and all that dependency gets rolled up within that container image. And because you have all the dependencies wrapped up in a container, and because you have that consistent environment everywhere, if you're going to run, whether it's stateful or stateless, by having everything run within your container orchestrator, like in Mesos or DCOS, the thing you're avoiding is, you're avoiding snowflakes, right? So if you have all your stateless applications in Mesos or DCOS, but you have your stateful, like, database or your NoSQL running off to the side, you're treating this set of hardware and this application like unique, right? You're having the baby, you're having to maintain it in a different way than you're treating your container environment. But by having everything run together, you're effectively treating all your applications, whether the stateless or stateful, in the same fashion. So that's kind of like the container attributes and how stateful applications can benefit from that. Now, if you take a look at what container orchestrators can provide, they provide things like health monitoring, right? So if you have your application, whether it's like a simple REST service, the container orchestrator can provide health checks and they could be something as simple as doing an occasional ping of the REST API to make sure that the API is still up and running. They also provide things like doing rollouts, right? If you want to roll out a new version of your application or if the rollout of your application didn't go so well and you need to roll back, they also provide that functionality. And then there's also declarative configuration, right? So if you want to deploy a Postgres database that has external storage attached to it, it's declarative in the sense that you just want to deploy it and what all the underlying details about the environment and the libraries that are associated with that application, you really kind of don't care about that. You trust that the container itself has all those dependencies baked into it. And so when you want to deploy that database or that NoSQL, the Cassandra cluster, you just really say I want to deploy Cassandra and the details are already taken care of you inside that container. And that kind of lends to the next point here is like if you look at like DCOS, right? They have their curated app store. It's effectively what it is. You have a nice push button interface. You go in there, you click the application you want to deploy, whether it's Postgres or Cassandra or Elasticsearch, and it's really easy to roll out those applications. And it's because of all of these things combined that you have that kind of experience in your container orchestrator like in DCOS. So we kind of are talking about containers and container orchestrators and all that stuff. So we really have to take a look at the fundamental problems about containers, right? So that's the fact that they're ephemeral in nature. So container comes up, any data that collects gets collected. And then when the container comes down, the container and the data itself is gone. But anybody that knows that if you're trying to run anything in like in a production environment, if you're going to run a stateful application, you need to have that state be available at all times. And not only that, but when you have failure on a particular node, you want to make sure that when that container gets basically moved around within your cluster that that data also follows that container. So that kind of leads to the two options that are kind of available for data persistence. So the first option is you can use a local attached disk, so basically direct attached storage, the hard disk that's connected directly to your compute node. And then the other option is external storage. So that's storage that lives outside the compute that can be attached to that host and be mounted within that volume. So I kind of want to just briefly touch upon the history a little bit and then we'll kind of start looking at what's available now and then what's available potentially in the future. So the ability to carve out X amount of gigabytes on local disk on a given compute node was introduced in Mesa's 023. And that just, yeah, basically if you wanted to persist some data for your application, you could say I want 40 gigabytes of this 120 gigabyte disk and you'll basically carve that space out for it. And then in September of 2015, actually the code team had introduced the Mesa's module dvdi, which was a third party component that you would manage outside of Mesa's. You'd install it alongside Mesa's, but this Mesa's module dvdi was it's basically a file system isolator implementation that called into existing Docker volume driver interfaces and that it would effectively be able to give Mesa's the ability to provision storage for anything running on the Mesa's universal containerizer. So and anything that was running on a Docker type workload, a Docker containerizer, it would just effectively plug into the Docker volume driver interface. And then in Mesa's 1.0, that Mesa's module dvdi, that actual, that file system isolator, as of 1.0 and higher, was recontributed back upstream into Mesa's and is now a part of Mesa's natively. So that external volume support is natively available as of 1.0. So we already talked about this Mesa's module dvdi, which is a file system isolator that calls into Docker volume driver implementations. And so one of the things that our team, the code team does is we have a Docker volume driver implementation called rex-ray. And what it is, it's an agnostic storage orchestration engine that abstracts back-end storage platforms and allows you to provision storage from those existing platforms. So we support stuff like AWS, GCE, and that's for stuff like basically all the public cloud providers. And then if you're looking to do stuff like on-prem, we support Cef, Cinder, and then there's a Dell EMC product called Scale.io, which is a scale out completely software-based storage solution. And the second one I've already talked about is Mesa's module dvdi. It allows you to, so the first one, rex-ray provisions natively volumes and storage for Docker workloads. And then Mesa's module dvdi does provision storage, which calls into, effectively calls into rex-ray and provision storage for the Mesa's universal containerizer. So the Mesa's containerizer. And like I said, both of these are open source. The second one's not really maintained as much anymore because it's already now in Mesa's, so the maintenance is happening there. But both of them are open source. They have the hub links right there. And yeah, you can take a look at the repos and all the fun stuff that goes along with it. So we've already talked Mesa's, so I want to talk about a little bit about DCOS and the storage options that are out there. So who here is running DCOS or has used it or anything like that? Okay, cool. And how many people of those people who have used it have used, like deployed an application using external storage? A few people? Okay, cool. So the cool thing about DCOS is, like I've already kind of said, they have this curated repository that allows you to easily provision applications. And not only that, but if those applications require some form of external storage, there's literally a simple little checkbox that you need to do, how much storage you actually want to allocate, and then where you want to mount that storage within the container. And it's pretty easy to provision external storage for your app. Now, the cool thing is, is that if you've used that feature, that functionality before, under the covers for DCOS, what's actually running? So there's actually, for every DCOS node that's out there, it's actually RexRays, it's actually installed on every node. So without even realizing it, you actually guys who have provisioned external storage for DCOS are actually RexRays users. So we've talked about Mesa's, we've talked about DCOS storage options. And now looking to the future. So there's an initiative that's happening. It's called the Container Storage Interface, or CSI. And it's modeled after OCI and CNI. So OCI is, right, the Open Container Initiative and CNI is the Container Network Interface. The goal of CSI, well at least one of the goals of the storage interface is to standardize storage plugins across container orchestrators. And so the idea is, is that if you have a storage plugin implementation that implements the CSI interface, it's a spec right now. And when that spec finally becomes, hits the one point, or 0.1 release, which hopefully is sometime this month, if it hasn't happened this week while we're here. But if some plugin implements that interface, the idea is you could have Mesos and Kubernetes implement the client portion of that, call into these plugins and provision storage for your containers. And the idea is that it's standardized and that it's basically you can swap them in and out with many different implementations. If you guys are interested in that, in that topic, I only hinted at it. But if you want like a really good introduction and I think even deep dive of what the container storage interface is, or the container storage interface is, there's a session later on today at 4.30 at Nextdoor. It's actually being done, co-presented by one of my colleagues here up at the front. It's in Congress Hall 2. It's called the Container Storage Initiative. What's this project about and where are we going? I highly recommend you check it out. So now that we've kind of talked about the storage options that are available in Mesos and DCOS, I kind of want to take a look at the different deployment strategies for traditional databases. And so that covers what the initial deployment looks like. And then also for day two type stuff like disaster recovery and maintenance. So traditional databases. So when it comes to like traditional databases, they tend to be simple and kind of straightforward. At least the majority of deployments for like Postgres, MariaDB and MySQL. And because they're simple and straightforward, they tend to be monolithic, right? It's a single instance with a single collection of databases, but they're all tied to one instance that's living within a container. Now there are other types of deployments that are more complex. So you can take like a MySQL database and you can do sharding on that, or you can even cluster it. I'm only going to focus right now on the simple deployment. It's just kind of basically to get a point across kind of a thing. But if anybody who's interested in like sharding and clustering and some of the more complex deployment cases, I'm definitely going to hang around afterwards and we can talk more about that later. So when you do an initial deployment of like MariaDB using local disk, like I said, it's simple and straightforward. You basically, if you're going to deploy it, whether it's in DCOS or whether it's in Mesos, if it's in DCOS, it's even easier. You can just click on the App Store, click add the external storage, and then off it goes, right? You have a MariaDB instance. Now the interesting thing is if you, when you deploy on local disk, the performance is obviously based on the compute nodes storage capabilities, right? So if you have a cluster that's all the same, and if you only have like slow rotating disk on every node in your cluster, obviously your performance of your database is going to be directly proportional to the slow rotating disk on that node, right? Because if they're all the same. Now, if you have like a heterogeneous environment where all your compute nodes are somewhat different, maybe you have a section of compute that's only running slow rotating disk, and another section that's running SSDs, you have, if you're going to do a deploy, your initial deploy of your database, you obviously have to be aware of what type of performance characteristic you want for your application, right? So if you don't really care about the performance, maybe the slow rotating disk is fine, but if you do care and you want something a little bit faster and you want to land that container on SSD, you have to do a targeted deploy of your container. So you'd say, I want maybe like the gold tier of storage versus like the silver or bronze tier. What that means is now you have to be aware of like where those resources exist within your Mezos cluster, and there are ways to handle that. So you can basically tag like nodes saying this provides gold tier and this provides silver tier and this provides bronze tier, and then when you do your deployment, right, you would deploy based on that tag, right? The tier of storage, the class of storage you want. So what does deploying a traditional type database like MySQL look like on external storage? So obviously, you need to have an external storage platform, right? And because you have an external storage platform, there may be some specialized setup or configuration for that given platform. So as an example, right, if you happen to be running an AWS, most likely your storage platform is going to be EBS. And in that case, setup is trivial because it's given to you and because you've probably already been playing around with EC2 instances, that it's you already familiar with EBS and how it operationally works. So the knowledge that's required to run that storage platform is minimized, right? It's almost trivial. Now on the flip side is if you have a external storage platform that's, you know, maybe you're going to run a sender implementation, you have to know about sender, you have to know how to configure it, you have to know how that sender implementation interacts with the back end storage, right? And you also have to think about the like the day two type stuff for that storage platform. So what I'm talking about is the maintenance of that storage platform. In the case of EBS, I guess you just trust that Amazon's going to do the right thing and they're going to take care of your storage. And in the on-prem case, like if you're running Seth, and you have a store back in storage platform that's associated with it, and you have to be aware of what kind of maintenance operations you need to have on that, whether it's a storage array or whether it's a completely software based solution, you have to be aware about the maintenance that's required to make that storage healthy at all at any given time. So all that stuff is what I'm saying is managed outside of Mesa. So that's something that you have to take care of based on the storage platform. And just like rotating disk or local attached disk, whether it's rotating disk or SSDs, the performance is based on that platform. And the same is true for external storage. If you're like an AWS, I believe, well, traditionally, if you're an AWS and you're using EBS, there's a like a one gig limit throughput on EBS volumes and you're basically stuck with that kind of performance characteristic. And then the other thing too is if you're going to use an external storage platform, minimally, you have to be sure that a subset of your cluster nodes, so your compute nodes, have access to that storage platform. That's the minimal case. Now, if you want to make sure that your container can freely float between any node in the cluster, any compute node in the cluster, you'd have to guarantee that that storage is accessible everywhere. Otherwise, you'd incur the like a potential, you know, I'm going to try to provision storage. I don't have access. I'm going to bounce to somewhere else, right, to another piece of compute. So that's the deploy, right? But in most, a lot of presentations and sessions kind of stop there. But I kind of want to look at the day two stuff. And I think the day two stuff is more interesting. And it's because things can go wrong, right? You have hardware failures, you have maintenance that you need to do. And so I'm calling like this, the oh shit moment, like when you realize that you have a problem and you need to fix it. And the type of storage that you're going to pick can heavily influence like what difficulty you're going to have in your day two type operations. So day two operations using local disk. So if you're going to use local disks like direct attached storage on your compute, the biggest problem is that you have data locality, right? All your data for your given container or your application exists on direct attached storage. So you're susceptible to things like disk failure, host failure. And those are the obvious things, right? Now what happens if you have to perform maintenance on that particular host? So that's not to say that your data is completely gone. But if you have to take that host down for maintenance, and you have to add more memory or upgrade the nick on the thing to go from like single port to dual port, your application for us, however long it takes to perform that maintenance, is going to be down for that given amount of time. So you just need to be aware that local disk is going to tie you to that node. And not only that, but anytime you need to, anytime you have that application or that node go down, you are essentially fixed to that node until that node comes up. So in the case of hardware failure, you have to have standby hardware or whatever to bring that node back up as quickly as you can in order to preserve that data. Now this is, right now we're focusing on traditional databases that are kind of monolithic. It's going to be a little different in the case for like no SQL, and we'll touch upon that a little bit later. And then the other thing that you kind of have to consider too, right, is that that host, that node only has a fixed amount of disk space that's there, right? So it only has a limited number of disks that are in there. And when you deploy your application, like whether it's MariaDB, Postgres, you have to provision all of that storage up front. And so the more storage you take, if you have other applications that require some amount of storage, you potentially might run into a thing, a situation where you might not have enough storage capacity to run other containers on that node because you have to reserve all that capacity up front. And then the other thing is what happens when you start eating up, like say you do deploy Postgres and you say, I want all of the storage that's available on this host and just give it to me. And it's effectively almost, unless you don't care about persistent applications on that node, you're only going to be running transient, like ephemeral containers on that node. And the only stateful app is going to be your Postgres instance. But what happens, even taking all of that storage, what happens when you need more capacity beyond what's already there? So there are ways to get around that. I mean, you could hotswap more disk in there and stuff like that. But now it becomes a manual process. A user has to get involved, like a system administrator or whatnot. And it is not trivial. It's not easy. There are ways around it. But now it becomes more of a difficult problem that you may have to perform maintenance, may have downtime for your application and whatnot. And the kind of the idea is if you do have a user that has to get involved, like an administrator or whatnot, now you're treating this particular node like a snowflake. It's a special case thing that I have my Postgres instance there. I have my data sitting on these hard disks. And I have to treat this service special just because we're running out of storage or disk space. So now day two operations for external storage for traditional databases. So the cool thing about external storage is that the volume will move with the container. And you have to make sure that you have a storage orchestrator, something like rex ray that can provide that functionality. But if you ever have a hardware failure, whether like the memory goes bad or the motherboard on the server goes bad, or the disk itself, well, not the disk, but the host itself, whether it's like a power supply issue or whatever. If you have those types of issues and that container needs to get rescheduled to another node, like something like rex ray, like a storage orchestration engine can detach, forcibly detach that volume and make it travel along with the container to whatever new piece of compute that it lands on. And the same is true for maintenance. If you need to take that host down, that host becomes no longer becomes responsive. And that like in the mesos case in the mesos world and DCOS world, your mesos master will realize that that agent is now offline and it will reschedule the container to move somewhere else and that volume will travel along with it. So when that container comes back up, that storage gets reattached to your postgres instance and all your data is available. And so basically what I'm describing here is I'm talking about high availability for containers, right? Being able to have your application die come up somewhere else and then and also have all the storage and data that's available for your application. And then the other kind of cool thing is because you're using external storage and depending on the storage platform itself, if you are using external storage and the storage platform supports it, you have the ability to grow beyond what the current capacity of your storage platform is. So if you're using like a traditional, you know, storage array with big box, you have the ability to add shelves with more disks, expand that one or whatever and then do all that fun stuff, right? I use the old case, but there are other platforms that are out there that are more suitable. I already kind of alluded to one. It's called Scale.io. So that one's a completely software based storage solution. So it's basically you just install software on your compute nodes. And then if you want to expand the capacity, you can just add more disks to that node, to your Scale.io cluster, and that will automatically expand the capacity for your storage array and you can expand the storage for your given container. So we talked about the simple use case, right? Traditional databases that are kind of monolithic and that are standalone. So now we're going to take a look at NoSQL and key value stores and see how your deployment strategies using storage affect how those day two operations get affected based on and initial deployment get affected based on the storage of whether using direct attached or external storage. So it turns out for NoSQL and like key value stores, those types of applications, whether it's like Elasticsearch, Redis, MongoDB or Cassandra, the initial deployment is actually quite similar to a traditional database. So if you're going to use local disk, you're going to just go ahead and you're going to say if it's like happens to be Cassandra, you're going to deploy three nodes, you're going to say I'm going to use local attached disk, I'm going to use this much space and your like three Cassandra instances are going to get deployed in your cluster, in your Mezos cluster, and you're off and running. And it happens to look exactly the same for external storage. So in the external storage case, you're going to create three instances of Cassandra, you're going to have three external volumes that are associated with each of the three instances of the Cassandra instances, you're going to launch your Cassandra cluster, something like Rexray will make sure that the storage gets attached to your container, and then you're off and running. And so because they're the same, if you're using local disk, you're still constrained by the, just like in the traditional database case, you're still constrained by the performance that's available on the node, you're still constrained by the size, the amount of disk space that's available on the node. And in the external storage case, you're still constrained by having to have an external storage platform, having to know the details and the maintenance operations for that external storage platform. And then you still have to worry about access among the various nodes of compute in your Meizos cluster. So all that basically stays the same. There's some little differences, but generally it's the same. Now the interesting stuff is what happens on day two, like when you have maintenance, when you have hardware failure, when you have some unforeseen maintenance event or whatever that you have to take care of, it gets interesting because it's the behavioral characteristics of these NoSQL databases and these key value stores. And it's basically because of the eventual consistent behavior that these NoSQL and key value stores have, right? Because they're all distributed and they're all multi-node. They're all making sure that within the Cassandra cluster, they have data availability, which means there's copies that are being sent to the different various nodes that are out there. And so that's kind of what kind of sets these NoSQL databases and key value stores different, sets them apart from traditional databases. So because those types of applications are somewhat more difficult to manage, Meizos and DCOS in the form of their app store, they have a really awesome concept that's really special to Meizos and DCOS. And that's the whole idea of frameworks, right? This two-layer scheduling mechanism that they provide in order to make basically effectively specialization for your particular application. So they can do things instead of like launching a generic application, frameworks can actually tailor operational and behavioral characteristics for your application. So as an example, I actually have pictures of the Elasticsearch framework. This is what the UI for the Elasticsearch framework looks like. And when you deploy this framework, what ends up happening is this framework will actually go ahead and the minimum for eventual consistent databases is to have three nodes. And when you deploy the framework, it'll find out what compute nodes best suit this particular instance of Elasticsearch, and it will automatically deploy Elasticsearch three instances to those nodes. And if you also take a look here like towards the second screen that's a little behind there, if you want to scale out Elasticsearch, you want to go beyond the three instances that are there, the framework has support for saying instead of three, I want five, right? I want five instances so that I can tolerate failure better. And so these frameworks go much beyond the initial deployment. So a lot of the frameworks have the ability to do like I said, scale in, scale out, and monitoring. So like monitoring in the form of I recognize that this particular, and it's not that maybe the container is actually healthy, it's actually up and running, but maybe the application itself, Elasticsearch, has kind of gone sideways and gone south, and it's not behaving correctly. The framework can recognize that, right? Because it's application-specific monitoring, and it could bounce that node and move that Elasticsearch instance to a new node, effectively starting up and resurrecting it from scratch. And because it's going to have functionality to move like an Elasticsearch instance from one node to the next, it has the ability to do automated recovery and effectively bootstrapping and rebuilding that node to rebalance the data along your Elasticsearch cluster, right? So frameworks are all great, they're wonderful, I love them, I've implemented one myself. Now the interesting thing, kind of like the elephants in the room, is that all frameworks are great at deploying things. Like if you can't deploy your application, then your framework kind of like doesn't work, right? So all frameworks are good at deploying applications, some are even good at monitoring, and a small fraction of those frameworks are actually able to handle like disaster recovery. So when a node goes down, I want to make sure I can recover from that situation. So if we're going smaller, initial deploy, good at monitoring, and then even like a smaller subset is disaster recovery. Now the problem with disaster recovery is that what you really want to do is you want to do efficient disaster recovery, right? So if a node fails, I want to make sure I want to handle that failure in the most efficient way as possible. So I'm a Star Trek geek, sorry. This is one of my favorite scenes from Star Trek, yeah, sorry, Star Trek Generations. And we're going to take a look at what happens when we have a NoSQL database or a key value store. Like when we have our oh shit moment, that's what he's actually saying, he does say that in the movie. Like what happens in day two operations when I have a hardware failure or when I have to do maintenance on my application or I have hardware failure on my node? Like how do we handle those types of situations? So if we take a look at what NoSQL databases and what key value stores, what they look like when you're deploying on local disk, and I'm going to use Cassandra here as a specific example because it's just to give us a concrete example. If you need to fail over one of your three nodes of Cassandra using local disks to another piece of compute, and you need to do a rebuild on Cassandra to basically restripe your data across Cassandra, obviously, the less data you have, the faster that's going to be. But like any other application, if your application consumes and throws data into Cassandra, the longer it's been running, the more data you're going to accumulate. And the more data you have, if you have like a Cassandra cluster that's very dense in data, if you move that Cassandra instance from one node to the next, that rebuild process can take a lot of time. And it's so much time, in fact, it can take hours. And it's so much time depending on how dense your Cassandra instance is, that it can actually even take days. And I'm not going to go into all the operational stuff in the deep dive in Cassandra and any of that stuff. I'm just giving like a generalized overview of no SQL. But if you're interested in more information, I'm going to butcher his name. I can get the first part. Alexander Dejanovsky. So he presented at Cassandra Summit in 2016. And the session was entitled, How to Bootstrap and Rebuild Cassandra, obviously. I threw the YouTube link in here. It's an hour-long presentation. It's actually fantastic. It kind of gives you a great look into how these no SQL databases, like how they operationally work, how you do disaster recovery, how you do bootstrapping and rebuilding. And he even references in his presentation that he jokingly referenced that it could take as long as 15 days, but he actually had an example where that actually did happen. So it could get quite frightening if your application is crippled for 15 days. And if you want the link, you don't have to write down the little thing at the bottom. You just download the PDF on the Mesoscon agenda thing and download the presentation. The link should be there. So when you're actually doing this bootstrap and rebuild process, the latency for Cassandra is increasing. And the reason why is the repair process is expensive. So it's expensive in the fact that when you're trying to rebalance all the data within your Cassandra instance, you're moving data from one node to the next. So you're eating a lot of CPU, you're eating a lot of IO. And the reason why that's happened is because you're rebuilding the Merkle trees that are within the Cassandra on this new node. So you're basically restriping the data and building these Merkle trees in Cassandra. And what that translates to is because Cassandra itself is kind of crippled, your application, which depends on Cassandra, is going to start to slow down. And it's going to potentially even grind to a halt. So it's very expensive to have a failure in these NoSQL databases. Now granted that how you deploy your Cassandra cluster is important. What your replication factor is is important. But in any case, the actual overall performance of your Cassandra cluster can be degraded, depending on how it's configured. And if you're not careful, it can actually even bring your Cassandra cluster down. So there's actually this, if you go take a look at the video, there's this actually really cool section of his presentation where he talks about the actual part of doing the repair process in Cassandra is so IO intensive that it actually brings itself down. So yeah, it's very, so what I'm saying is losing local disk, you have degraded performance. And if you're not careful, you can actually have your entire application be unresponsive. And the other thing with local disk is you have a window of vulnerability, right? So if you have three, three node instance of Cassandra and one of your nodes goes out, you have this window of vulnerability where however long it takes to bootstrap and rebuild, that window, if there's another failure on one of your other two nodes, you could potentially be losing a lot of data. And like I said, it depends on how what your deployment strategy looks like for Cassandra, what kind of replication factor you have. But I kind of liken it to like, you want to minimize risk. And the last thing you would ever want to do is you don't want to run Windows, you don't want to run Internet Explorer, you don't, if you don't have an antivirus program and you don't have spyware, you're that's pretty risky behavior, right? You don't want to do that. So the idea is minimize risk that just makes sure that you're covering your, you know, CYA, right? So limit risk, make sure that your data is always available. And local disk, because all your data is tied to a given node, migrating over from one node to the next, you're incurring this expensive rebuild time. So we talked about local disks and how no SQL databases and key value stores are affected by like the day two type operations. So this is how external storage can help. So if you have a bad disk, the entire compute node fails, even if you have something simple as a network partition event, right? The Cassandra instance no longer, the Cassandra instance on a given node no longer becomes responsive, because something's going on. The agent can't talk to the master node. That Cassandra instance will get rescheduled somewhere else because we're talking external volumes. That volume will travel along with the container. So that Cassandra instance, when that Cassandra instance comes back up, that volume will get reattached to the node and effectively it'll basically continue on where it left off. And yeah, like I said, if you have something like Rexray, which can do that storage orchestration, you'll have that functionality already built in. So if we do move one of the Cassandra instances from one node to the next, and we are using external storage, that operation itself to move that volume, right, take the container down, bring the container up, move the storage over, attach the storage to the Cassandra instance, that takes time, right? But that small delta of time that it took, whether it's a minute or two minutes, in that one or two minute time frame, you're losing the amount of new data that's being funneled into Cassandra. And the cool thing is, because all that existing data is being moved over to a new compute node with a new Cassandra instance, when you do a rebuild and repair, you're only talking about moving two minutes of data over from the other healthy Cassandra nodes. So when you do the repair, you've got the healthy node on one side, the middle kind of column represents the delta of data that you need to copy over to the node that had failed. And once that copy operation is complete, which is significantly less time, especially in a data dense Cassandra cluster, it will be insignificant compared to a full rebuild. And if you think about something as simple as maybe a network partitioning event, the last thing you would want to do in a network partition event, which could be transient, right? Maybe you have network connectivity problems that last two minutes or maybe 15 or even 30 minutes. The last thing you would want to do is do a full rebuild of a Cassandra node, start copying data all over the place, only for that node to come back online like 15 minutes later, especially when a bootstrap and repair operation can take 15 days, right? So kind of brings me to the end of my presentation here. I'm just going to wrap this up really quickly so we can take questions. So yeah, using local storage, you have an availability risk, right? You deploy an application. All your data is effectively pinned to that storage on your compute node. Your host goes down, your data goes down along with it. A scale limitation, right? You need more storage than the host has. Well, I'm sorry. There are ways to get around that, but as soon as you go beyond the actual shelf capacity on that compute node, then you're going to run into some problems, right? You have to figure out a way to migrate that data over to a new compute node with a new instance and try to bootstrap your way to getting more storage effectively. And then there's also performance characteristics that need to be aware about. So if you want a higher performance, run on SSD. If you don't really care, you can run on rotating disk. But it is simple and it's relatively low cost, right? And it's very easy to get going on day one. And if as long as you have enough capacity for what you need for your stateful application, maybe that's good enough. And then for external storage, right? If you have a stateful application that you need to have migrate from one node to the next because of hardware failure, maintenance, or a network partitioning event, external storage can enable that for you, like using something like Rex, right? So container moves from one host to the next, volume moves along with it. And then you can do things, take advantage of things like do thin provisioning. You can add more shelves to your storage platform if you have on-prem stuff. If you're in the cloud, maybe that's as simple as adding another EBS volume to whatever you need. And then just be aware that the performance characteristics just like local touch disk is based on the storage platform itself. And that is it. Thank you, guys. Questions? I need a softball, just like pitching. Hey, thanks a lot. Great talk. That external storage, the data travels through the network, right? So there is a performance hit in terms of latency and bandwidth as well. You have to do a capacity planning for the network to make sure that you can accommodate all this traffic. But then if we run Cassandra on external storage and we want to scale up the cluster, the rebuild will have to be done, right? So it will be even worse on the external storage because of all this performance impact comparing to the local disk. Well, so in the scale-out case, so there's some storage platforms that are out there that are actually very high-performance block storage. I would encourage you to take a look at Scale.io, and that one's a completely software-based storage platform. And what happens is in Scale.io, and like I said, it all depends on the storage platform itself, right? So if you have a very poor performance storage platform, you're absolutely right. But if you have something that's like Scale.io, where it's an elastic scale-out software platform, and the more nodes that you have that are available that are offering up storage, when you actually pull data as a client from your storage platform, it actually stripes data all at once from multiple nodes. And so you actually gain your throughput because you have so many nodes that are scaled out. But you're absolutely correct. If depending on the storage platform you have, if the performance, if you have a bottleneck, then a scale-out where you're actually adding maybe a fourth or fifth node can be very expensive. But if you have something like Scale.io that's providing the storage underneath, you'll find that the performance is pretty well matched. Okay, and the follow-up, can someone run something like a Scale.io or Cef on Mesos itself using the local disk? Like coexist on the same workers? You can. Actually, there's actually a, if you want to take a look, it's on our GitHub page for my team. I actually wrote a Scale.io framework that I'll actually deploy Scale.io on an existing Mesos cluster. So what it'll do is it'll bootstrap like the metadata nodes for doing the data. And it'll also throw up all the, it's basically they have a Scale.io server which consumes local attached disk. And then they have another component which is a client that allows you to connect and attach storage from Scale.io. And if you want to have a fully converged, like the easiest examples, if you want to have a fully converged storage infrastructure, like if you would have every node, potentially every node of your, every compute node have disk, local attached disk, it would contribute those disks to Scale.io. And what it would do is take those individual disks, make it look like one ubiquitous piece of storage. And when you have a client that connects into it, when it's actually doing reads from it, it's actually reading from multiple places at once because it's striping it across all of your nodes. That's pretty cool. Sorry, the last one. I'm just not aware of the CSI how this work goes. In that case, if we run the storage platform on the same working nodes, is there a way to guarantee the data locality to place the container close, maybe close on the rack, maybe close on the node to the actual disk space to avoid the network traffic? Right, yeah, exactly. So if you, like I said, depending on your storage platform, if your storage platform is more performance by having you land closer to where you need to be to the source, then you'd have to do something like tag those hosts and say, I want to land, like on one of these given nodes, because I know the performance of that storage platform is going to be better than if I land over here. And in the Scale.io case, as an example, because you're striping data from all those, you know, all the, effectively, that's one ubiquitous pool, all the disks that are on the computer being contributed to Scale.io, the cool thing about it is no matter where you land, since you're striping data from everywhere, it's, you don't really care where you land because it's all the same, right? Hello, you said Gurbrinda. My question is, in continuation to what you were answering, if we have a Cassandra cluster on external storage, is there any specific configuration in terms of read and write for the Cassandra setup, which is required to make it successful? Yeah, so that's what I was saying. Depending on how many instances, Cassandra instances you have, how many nodes of Cassandra you have, and depending on what your replication factor is, you can help mitigate some of these challenges that are associated with that. But like I said, it just all depends on what kind, I'm not saying external storage or local attached disk or anything's better than the other. Kind of the focus of the talk is, is when you pick whatever you're going to pick, just make sure that you architect your deployment, not only for initial deployment, but also for day two type stuff, such that if you want to run local, using local disks, just make sure that you have n number of nodes to tolerate failure, and make sure whatever replication factor you're going to set, like, compliments that, you know, that your deployment, right? Okay, so specifically, if I want to look for Cassandra as a persistent solution, this solution will work fine external storage with few of the good practices you just highlighted in continuation to that direction. Yeah, and if you want, we can talk a little bit after this because I think we're over now, I believe. But we can, if you want to come back up here, we can talk about it and talk about how different storage platforms, you know, lend itself to, like, running workloads like Cassandra and stuff like that. All right, I think we're out of time. Cool. Thank you very much. Thank you, guys.