 Welcome everyone. My name is Dinesh Sranee and I'm a software engineer at Portworx. Today I'm going to be talking about how you can use external persistent volumes to achieve high availability and reduce recovery times when using stateful services on DCS. So let's jump right in. Here are topics that I'm going to cover today. We'll talk about the different kinds of stateful services. Then I'll go through the advantages of using external persistent volumes. I'll also give you an introduction about Portworx and how you can deploy services on DCS to take advantage of Portworx volumes. Then I'll do a demo showing how you can install Portworx and Cassandra to use Portworx volumes and basically demonstrate failover and some other useful scenarios. So let's talk about stateful services. So there are basically two types of stateful services when it comes to persisting data. The first are simple applications which don't do their own replication. They basically rely on the underneath storage layer to be always available. So in cases of failures, the storage layer would basically make the same volume available on another node so that the application can come up with the same state. The second type is where applications do their own replication across nodes. So in case a node dies or fails, there's always another copy of the data on the cluster. So if the node that had crashed does come online, then the replication takes care of basically repairing data back onto that node that it had missed. Now this replication or re-replication or repair can either be manually triggered or it can be automatic depending on the application. So some of the examples for the first type of application are basically WordPress or MySQL which run in simple mode and don't do their own replication. Whereas for the second type, it would be Cassandra or HDFS which basically do replication across nodes and then do repairs in case of failures. So now you might be asking why is this replication strategy important? Well, it's because bad things happen all the time, right? Your nodes could crash, your network could have issues. For example, your network could get partitioned so none of your nodes are in quorum. Your disk could go down or your nodes could go down. Your entire rack could go down bringing down multiple nodes. For applications that do their own replication, there is always another copy on one of the other nodes in the cluster. So you can still continue to serve IOs and your application will not be affected by the downtime by one node being down. And if you have to replace a node, you can just bootstrap it and repair all the necessary data to it. This does end up taking a lot of time sometimes depending on how much data you had on the node that did go down. And in that time that you're repairing data to that node, your throughput for IO for that service would basically drop. For instance, if you had a Cassandra cluster that was doing replication, a three-way replication and one of your nodes goes down, you would then only be able to serve reads from two of the nodes instead of being able to serve from three of the nodes. And while the repair is going down, your network would also take a hit because all the data would need to be transferred from the two nodes back to the one node that is being bought up. For non-clustered applications, though, if you had no backup and were using local storage, your application would be doomed because you would not be able to bring it back up until you can either move your data from the nodes, data disk from the nodes that went down. And in case your data nodes actually had corruption, had physical corruption, then you will basically end up losing your data. So how can external persistent storage help in all of this? Well, for the case of applications like MySQL and WordPress, which don't do their own replication, it can help to provide high availability for your services. And it basically makes sure that your downtime is eliminated. And for services that do do their own replication, it can help reduce recovery times by a large amount because what will end up happening is you will not have to bootstrap data back to your new node. All you would have to do is basically repair data for the time that the node was down before it came back on another node. I'll talk a little bit more about this in the next slides. Another advantage of using external storage providers like Portworx is that it helps you virtualize your storage. So you can basically grow your compute and storage needs independently of each other. Okay, so before we talk about the scenarios, I just wanted to give a brief introduction about Portworx because a lot of the scenarios that we talk about depend on a software storage solution like Portworx already installed on your cluster. So Portworx is the first production ready software defined storage designed from the ground up for microservices in mind. So using Portworx, you can provision and manage container granular virtual devices. And tight integration with schedulers and container orchestrators basically help run your workload local to where the storage is provisioned. Apart from this, we also have a couple of other useful features that help manage your storage daily needs for storage. So you can basically take a copy on write snapshots which can be restored from in case you have any outage or any issue with the services that are using your volumes. You can also take something called Cloud Snaps which will basically back up your entire volumes into any object store outside your cluster. So basically that object store can be any S3 compliant object store, Azure Blob Storage or Google Cloud Storage. We also allow encryption of volumes on either a cluster level. So you can have one cluster wide encryption key or you can have encryption keys on a per volume basis. So this is very helpful in cases where you have multi tenants and you don't want to share encryption keys between different volumes and different services. Another feature of Portworx is that everything is basically API driven so everything can basically be automated. You don't have to manually go in and either provision volumes or manage your data. You can do, you can basically, this is written for DevOps from the ground up. So everything is basically automatable and you would never have to actually go in and manually do any kind of repair or recovery. And we actually run as a container ourselves so it makes it very easy to deploy and maintain. And you must have heard about CSI in the past couple of days and as soon as the schedule orchestrators have support for CSI, Portworx will also have support. So this is how basically Portworx would look once it's deployed. So Portworx would basically scan all the block devices whether it is your direct or attached storage devices or an EBS volume or Azure managed disks and carve them out into one big cluster storage pool. And when your pods or your containers are spinning up, they would basically go ahead and request for thinly provisioned volumes from this cluster storage pool. And we would basically replicate data for these volumes across nodes so that in case one of your nodes goes down, you're still able to bring up the container on another node with the same state. So here basically the orange part is what Portworx would comprise of and you would basically run different apps which would request for volumes from Portworx. So I talked about block level replication. This is how basically it would work. So all writes would basically be synchronously replicated across multiple nodes. And these replicas are actually accessible not just from the nodes where the data lies but from any node where Portworx is installed. So in case you want to scale your cluster where you have a few nodes which have more compute and memory and have separate nodes which have more storage, you can always start up Portworx as storage less nodes to consume storage from the nodes which have more storage available. And in this case, what would happen is basically all writes would basically be synchronously replicated onto other nodes. And if a node goes down, basically you would still be able to use another replica from another node. And when the node that had gone down basically comes back up, Portworx in the background would repair the data to bring that replica back up to state with the current state of the volume. And if that node goes down permanently, what Portworx would do is basically we would re-replicate that data onto another node so that you always have the minimum replication that you had specified for your volume. So now let's talk about the recovery times with local volumes and external persistent volumes. This is mostly talking about the second type of applications where applications do their own replication. Now, when tasks use local storage, they are basically pinned to that node. So with local volumes, if a node crashes, the faster that the node comes back up online is better because then you have less amount of data to repair. But in reality, that is not always the case. You could take down nodes for maintenance cycles. And in that case, your time period that the node is down could be large. So you will end up having to repair a large amount of data. And nodes do take up some time to come back up in case they had crashed. So that just extends the cycle, extends the time that you would have to repair data for. But in case that the node fails permanently, this repair time can be even longer because you would basically have to re-replicate all your data from one of the other replicas in the cluster back to that node. And this could end up taking a lot of time. It could range from anywhere between hours to days. And while this is happening, like I had mentioned before, the throughput of your service will be affected. And also, you would end up spending a lot of, you would end up spending, you would end up spending a lot of time. And this could actually bring down your entire service if you basically end up using a lot of throughput just to do the repairs. Now, let's take a look at the recovery times with external storage. So with external volumes, basically your data is accessible from any node in the cluster. So if a node does go down, you don't have to wait for it to come back up to basically to use your, to bring your service back up. All that would need to happen is your scheduler would need to schedule that service onto another node and ask for it to use the same volume and it would come back to its same state. And all you would have to do is basically repair data for the time that it took for the scheduler to reschedule your pod onto another node. And this is actually similar in case a node dies permanently too because since there's always another replica available on the cluster, you would just need to wait for the scheduler to schedule your container onto another node. And it would, again, you would just need to repair the data back onto that node for the time that the node was down. Okay, so here are some of the advantages of using Portworx or any software defined storage like Portworx for your solution. Basically, using a SAN or NAS is an anti-pattern in microservices because by using a SAN or NAS, you end up losing the flexibility that you have from microservices because you're basically saluting your storage from out into a storage array that's outside your current cluster. This also introduces latencies because for all the data that has to be written to your storage array, it needs to basically traverse from your compute cluster to your storage cluster. And it is also actually a one point of failure. So in case you lose that link between your compute cluster and your storage cluster, all your stateful services will go down at that point. And actually having an external storage provider, having an external storage array increases complexities and failures when you're dealing with node crashes and other scenarios that might happen in your cluster. So like I mentioned, Portworx is built for microservices from the ground up. And one of the things that we do is that we have tight integration with schedulers. So we can actually influence schedulers to co-locate your tasks with where the data is located instead of basically having it located anywhere. So we don't shard data across multiple, across all the nodes we make it so that containers can take advantage of having data local to the node where they're scheduled. Another advantage is that there is a common solution for hybrid deployments. So there are a lot of scenarios nowadays where companies have hybrid deployments where they have an on-prem cluster, but they want to basically burst into the cloud. Now, if you are not using a software-defined storage solution like Portworx, you would basically need to have automation and tools that were different for both your on-prem and cloud solutions. By having one unified layer of managing your storage, you can eliminate that and build your apps faster rather than having to worry about managing your storage on different deployments. Okay, so now you might ask why not just use EBS directly? So like I pointed out, first of all, EBS will work only on AWS. Same thing with other Azure Managed Disk or Google Cloud Storage. So you would basically need to have different automation and tools for your hybrid deployments. Also, EBS has a limit of 16 volumes that you can attach to any EC2 instance. So in case you have beefy EC2 instances and you want to work a lot of containers on that, you will end up hitting a limit. So you will not be able to tightly pack your services onto a node in that case. One of the most common scenarios that you would run into with EBS is actually when you're testing failover scenarios, you will see a lot of the time EBS volumes would get stuck in attaching or detaching state. This is, again, this problematic if you want to basically automate your entire environment because somebody will have to manually go in and then detach your volume or attach it to the new node. And in fact, I was just talking to somebody earlier this week and they mentioned that one of the volumes got stuck in a detaching state because they were using a volume plugin and there was no way out of it. The only thing that they could basically do was delete the EBS volumes and that is a data loss situation that you don't want to get into in production. Another thing with EBS is the performance is not always up to mark. So you can always pay for provisioned IOPS, but at that point you will end up spending a lot more money. And the last thing is failover is slow. So EBS wasn't designed for the microservices in mind, so it wasn't designed for scenarios where you would basically be failing over your containers or services very often. So it ends up taking a lot of time to failover your EBS volumes from one node to another. So there's nothing wrong with using EBS as such. I mean, the only thing is you don't want to be using EBS volumes directly with your containers. You want to have a layer on top of EBS which does the management of your container granular volumes so that you can easily failover your containers instead of having to depend on EBS doing the move for your volumes. So how can you use Portworx with your stateful services? So there are three ways that you can do this right now. So the first way is you can deploy simple services in Marathon using Portworx as the volume driver and any applications in the DCUS universe which allow you to configure an external volume provider. You can basically specify Portworx as the driver. The second way is you can basically deploy services that we've developed based on DCUS comments which are also available in the universe. I'll go through these in detail later. And we also have the DCUS comments framework which we've modified to be able to use Portworx volumes. So you can always use that to develop your own services too. All right, so this is an example of how you can use Marathon to use Portworx volumes. This is a simple example of a MySQL container that you would spin up. All you would have to do is in the parameters for your container, you would basically need to specify the volume driver as Portworx. And in the volume parameter, you would be able to specify the size for the volume, the replication factor, the name for the volume, as well as any other parameters that you want to use while creating the volume. And you don't need to pre-provision volumes at all in this case. All of this would be dynamically done. So as soon as you launch this, Messos would basically try to spin up a Docker container. We would get a request to basically mount this volume. We would see that this volume has not been created. These options would then get passed to us and we would dynamically create these volumes and mount it inside your container. And you can basically use a similar type of, similar spec to use with Docker as well as UCR. So about the DCOS Commons-based services that I was talking about, so we basically modified, enhanced the DCOS Commons framework to work with Portworx volumes. And there are basically four services that are available through DCOS Commons and you can obviously write more. But the four services that are available are Cassandra, Hadoop, Elasticsearch, and Kafka. And we've actually added support for Portworx volumes and submitted these to the UNOUS. So all you have to do is go to the UNOUS and search for Portworx and you would be able to install these four services with Portworx volumes backing the state. So we've actually made a couple of more enhancements where we've allowed for the task to fail over between nodes because there is nothing that's pinning the tasks to a particular node now since there is Portworx backing these services. So this basically means that your services will have a higher uptime as well as your recovery times will be reduced by a large margin. We've also made changes to the framework so that your volumes are co-located with your tasks so this reduces latencies as well as network usage. So this is an example of a Hello World program. This is basically available in DCOS Commons and the parts that I've highlighted is basically what all you would need to change to use Portworx volumes instead of the default route and mount volumes. So DCOS Commons is actually a great way to write stateful services. The only thing is right now before they add support for CSI, they only support route and mount disks which basically means that your services are pinned to a particular node. So in case that node goes down, that task is not gonna spin up on another node because that data is available only on the node that went down. So we've, like I mentioned, we've basically modified it to support Portworx volumes and all you have to do is basically change the type of the volume from route amount to Docker, then specify the Docker volume driver as PXD, the Docker volume name, which is the one name for the volume. So over there you can also specify the size of, not the size, the size is a different parameter, but you can specify the replication factor you wanna use as well as if you wanna encrypt volumes and any other parameters that you would be able to pass to Portworx. So basically you can specify any service as a spec like this and specify Docker volume drivers and Portworx to take advantage of Portworx. And all of this, again, volumes are all dynamically provisioned so you don't have to go out of band or talk to your storage provider to provision volumes. All of this will be taken automatically when the service first comes up. So the source code for this is available at github.com slash Portworx slash DCS Commons in case you wanna take a look. All right, demo time. So what I'm gonna show you is basically I'm gonna install Portworx on DCS. I'm gonna show you how easy it is to install DCS and then I'm gonna install Cassandra and go through a few scenarios to basically show, go through a couple of scenarios that you could encounter in your production. Let me just open it up again, sorry. Yeah, so basically Portworx as well as all the other services I specified are available in the universe so all you have to do is go and search for Portworx. We're gonna select Portworx service and so we have a six node cluster which is basically five private nodes and one public agent and we're gonna install Portworx on the five private agents. We're gonna specify the management and data interface that we wanna use and just click review and deploy. And what this is gonna end up doing is it's gonna spin up an HDD cluster and then it's gonna spin up. So we use HDD to store our control plane state. So it's all gonna spin that up. It's gonna spin up Influx DB which we use to store statistics. It's also going to spin up Lighthouse which is our UI and finally it's going to install Portworx on all the five private agents. So I sped up the video a little because it ends up taking around five, six minutes to install but as you can see the HDD cluster got installed and the proxy got installed Influx DB got installed and Lighthouse got installed. So basically what's happening now is the Portworx Portworx is basically getting installed on all the nodes. And if you go to completed you'll see there are five tasks for the Portworx install that finished. So basically what I'm doing now is I have assessed into one of the private agents and I'm just gonna watch for the status from PixiCuddle which is our CLI and wait for all the nodes to come up. So as you can see, three nodes I have come up and I have purposely not attached a disk to one of the nodes. So you can see that all of these automatically get provisioned, get added to a cluster as either a storage node or a storage list node depending on if there were block devices attached to that node or not. So I'm just gonna pause it here for a second and then basically the cluster is up now and as you can see we have five nodes in the cluster four of these nodes have 400 GB disk attached to them whereas one of them is basically a storage list node. So now that the Portworx cluster is up what we're gonna do is we're gonna go ahead and spin up a three node Cassandra cluster on top of this. And so again the Cassandra Portworx service is available in the universe. I'm just gonna show you that there are no volumes created initially. So you just need to run pixicurl volume list and you can see there are no volumes. So now you're going to go back into the catalog and search for the Portworx Cassandra service. And we don't really have to change anything but by default the options are not specified because it depends on what you want to provision it as but we're gonna go ahead and specify that we want a Portworx cluster with the replication factor of three and this is basically gonna create three 10 GB volumes and attach them to each one of the Cassandra nodes. All right so we just have to click review deploy and then deploy and we're gonna look at the state of the service. Again I've sped it up a little because it takes time for Cassandra just to spin up to pull the artifacts but as you saw one of the nodes that spun up and it automatically provisioned the 10 GB volume with the replication factor of three and attached it to a node whose IP ends with 151 which we'll see from the DCOS UI is where the Cassandra node had spun up. So the second node comes up and then the third node comes up. So at this point all three volumes will be provisioned. We'll just take a look, just do a volume list again and we see that all three volumes have been created with the replication factor of three. So all we had needed to do was specify the base name for the volume and then for node zero it basically created Cassandra zero, Cassandra one and then Cassandra two. So now that the cluster is up, one of the scenarios that I was talking about was what happens if your node fails. So if you didn't have something like Portworks running providing storage to your Cassandra cluster the framework would basically keep retrying waiting for the node to come back up and it would not start it on another node unless you manually went in and said that you want to replace a node. Now if you replaced a node you would basically need to run a bootstrap and a repair command again which could be expensive. What's going to happen here is since we have replicated the data across nodes as soon as we kill the node so I'm just going to go ahead and kill one of the nodes where the task is running I'm just going to power it off. What we're going to see is that the framework is going to realize that there's nothing for the task that is pinning it to that node. And as you can see DCO has already figures out that the node is offline and the framework is also going to realize that the node has gone offline. There's nothing, all it requires from a node is our CPU memory and this and data that is not really pinned to a node. So it's going to basically go ahead and spin up that same Cassandra node onto another node which was node ending with 131 which is A3. So I'm just going to go back into A3 and we'll just run pixicural volume list which will show us that the volume has now been attached onto this node. Oh yeah, if you check status that's the node that we actually took down. And if you do a volume list you'll see that Cassandra too that's the task that we had killed basically comes up is basically attached to the new node right now. Now one more thing you must have noticed is that I actually just allocated 10 GB of data to each one of these nodes. Now if you're running in a production cluster and you spun up this cluster and gave it to the production guys to use you would soon realize that this is not enough. Now if you were not using something like Portworks underneath what would need to happen is you would need to spin up an entire new cluster and allocated more space. But with Portworks all you really need to do is I'm just showing the output of DF over here and you can see that the volume which is mounted under valuable SD mounts Cassandra 0 has a size of 9.8 GB, roughly 10 GB. So what we're going to do here is we can actually dynamically resize this volume so that you don't really need to take your application or service offline. All you need to do is run once a simple command and it will automatically resize your volume. So all these commands that I'm showing I'm using the CLI just to make it clear but all of these are actually using the same REST API interface that you can automate against. So basically you can check that as you can dynamically, you can list your volumes, you can even provision your volumes and even if you want to update the size of your volumes all of that can be driven through APIs. So at this point all an admin or a DevOps person would need to do is run a command or use the REST API to basically update the volumes size. So I'm just going to basically update the size to 100 GB. And this just takes a few seconds and if I check DFI 5 and S you'll see that this volume is now 100 GB volume. And this will basically be available to your application right away. So you don't have to take it down or do any kind of maintenance for this. I think I lost my slides. All right, so that's the end of the presentation. Any questions? So we actually evaluating port works to be production ready inside our company. And we hit like a chicken neck problem. It is the following. So we're using DCOS Marathon PXD driver, right? And we wrote some disaster recovery tests. So basically killing the node and checking the if not successfully rejoin the cluster. What actually happens is that Docker D after reboot is getting stuck for 15, 12 minutes, actually logging that PXD driver is inaccessible. Yes. After some internal timeout it proceeds and then picks it up. And the reason for that is that port works itself is inside container, right? So it's really chicken neck. Yes. So what would you suggest to work around the problem? Yes, so we have encountered that problem. That is actually a limitation with Docker. The way the Docker works is every time it starts up, it queries for all the volumes that it knew. And it basically tries to talk to the plugin to figure out what the state of the volume is. So the way we've worked around this is we're basically applying to roll out port works where you would install it as a run C container instead of deploying it as a Docker container. So there would be no dependency between Docker and port works in that case. So what would happen is you would kill Docker but port works would still be up and running. And Docker would basically go ahead and query port works for all the volumes and it would not block at that point. Yeah. Any other questions? So with the block level replication, what kind of speed hit would we be seeing with port works? Because it's having to send it out, but what kind of performance will we be looking at to replicate a block layer like three replicas? So anything greater than one replica, there will be a slight performance hit because obviously you're sending packets across the network and what we've seen is the hit is between three and 5%. So there is a small hit but at that cost you're basically getting an added advantage of getting high availability and all the other features that the storage layer provides in this case. All right, any other questions? Hi, thank you for the presentation. I'm just trying to get familiar with some of the concepts you introduced. For example, I wasn't aware of some of the components I saw in the least ETCD and I started wondering about data persistence and what is actually needed by port works to operate and what will happen if we lost, for example, ETCD? Good question. So the way it is in the universe, we made it very simple for somebody to just install and try out port works. For production though, we suggest that you install HCD independent of Mesos or DCOS because what will happen in that case is in case you're DCOS or in case you have to reinstall Mesos or DCOS, your port works installation will not be affected at all. So in this case, if you had to reinstall Mesos or DCOS, since HCD is using local volumes, you will not get back the same offers and you will not be able to bring back HCD with the same state. So in production, we actually suggest you have an external HCD apart from this. But the point over here is to demonstrate how easy it is and in case somebody just wants to try it out really quickly. And HCD is just used to store, actually just used to store the metadata for our control plane. There's no actual data stored over there. So it basically stores information on what volumes have been provisioned and all that such. And we are actually working on a way that in case you do lose your HCD, you will still be able to reconstruct HCD by looking at all the block devices that you have in your cluster and basically reconstruct that. So we are working towards having that. Hi, I saw you that you have to launch another framework, different from the methods framework in order to launch Cassandra. I assume with Powerbox. I assume it means the same with HDFS, isn't it? Yeah, so all these frameworks are based on DCOS Commons, which has a, so basically DCOS Commons has, is an SDK to launch and manage your stateful services. The only thing over there is that they had support for local route and mount disks. So we've only modified that framework to be able to support Portworx volumes. So, and we've just rebuilt the, we've changed the parameters that you pass into the YAML spec so that they can use the Portworx volumes. So it's basically one SDK and we've built these four services on top of that SDK. I think you answered my question, but all the functionality that has to this framework is still in your framework, isn't it? Like Kerberos for HDFS or TFS for Cassandra. Yes, all that support that's there in the base DCOS Commons is also in this. Yes. Okay, thanks. Any other questions? All right, so you can always visit us at our booth. If you have any more questions, you can always go to our website too. Like I mentioned, all our services are in the catalog. So you can always just search for Portworx and download and install them. You can also visit our docs website if you want to get started. We have a free PX developer version. We also have a PX enterprise version that's free, which has all the features in email and is free to try for three days. You can contact us at infoatportworx.com for more information. Thank you. Thank you.