 Testing, testing, testing, testing. Can you hear me? All right, perfect. Okay, we're going to get started. This talk is on how container schedulers and software-defined storage will change the cloud. Our speaker is David von Tennen. He's a software engineer with Dell EMC. And in a nutshell, this talk is about having HA containers in the cloud, high availability in the cloud, an environment where you have apps which can self-repair themselves. So without further ado, I give you David. All right, thanks guys. Thanks for showing up. So yeah, we're going to talk about container schedulers. We're going to talk about software-defined storage. And we're going to talk about when you take those two things up into the cloud, what does that mean for you? So I'm already in the intro. I work for Dell EMC and I work for the code team, which is a group that specializes in open-source software. We do various storage integrations with a container in the container space. So everything from Docker, Mezos, Kubernetes. And prior to this, I used to work in the backup and recovery space for virtualized environments in VMware in particular, so doing VMware backup. Yeah, so my name's there, code by Dell EMC, Twitter handle, and my blog. So in case you guys want to check it out. So the agenda. So we're going to talk about, I want to do kind of a really quick review about software-defined storage and what that really means, because if you go looking on the Internet, there's like a hundred different definitions of what that actually is. So I kind of just want to touch on some things that are common among some of those definitions. And then I want to talk about container schedulers. And I'm going to give a really kind of a high level. And then I'm going to kind of go into the weeds a little bit to talk about some of the offerings that are available in container schedulers today. And then we're going to take a look at what happens when you combine these two, right? We take software-defined storage, you take container schedulers. What types of things can you do, you know, when you tie these two things together? And then ultimately the kind of the whole idea of the talk is what happens when you take this stuff into the cloud. And today, apparently, I'm crazy enough that I'm going to do the demo live. So hopefully the demo powers that be are going to smile on me and it's all going to work. So software-defined storage. So, yeah, like I said, many definitions. But I think kind of like the two main things that most people agree on is that software-defined storage usually serves as an abstraction layer on top of the underlying storage. So there's usually some form of, you know, you've seen in the volume source where you have the idea of logical volumes and the idea of physical volumes. And it's kind of that abstraction layer that exists between physical and what's offered up to the end user. And also software-defined storage, the common kind of agreement is that it provides a programmatic mechanism, right, to provision storage, right? So if you don't want to, like you need to allocate more storage that or do an operation, a maintenance operation or do some sort of operation on your underlying, on your software-defined storage platform that there are APIs available to do that. So like a couple examples of what software-defined storage looks like. So NFS is a very simple example of that. And then kind of on the further extreme, like VMware vSAN, I don't know how many VMware users or people are familiar with VMware, but vSAN is like a good example of software-defined storage. So what makes software-defined storage unique? So I kind of already listed some of that stuff out, but I kind of want to dive into that a little bit more. So operationally, you manage provisioning of that storage and the data, like how that data is presented to the end user, independent of the underlying hardware. So I kind of already mentioned that, but I really want to make that point, like really, really clear. Then the second thing is physically, like on the physical side, right? You have an abstraction layer that kind of breaks apart what's offered up to the end user and what's actually attached to those systems, right? So as an example of NFS, you could have physical disks that are just normal rotating media, but when you have NFS and you expose it up to the end user, it's just an NFS share, right? And policy. So this is kind of a new thing that we're just going to throw in there. So policy, with software-defined storage, there's usually an automation of policy both driven by the external user and also to the internal storage platform. So what does that mean? So say I'm a user and I want some storage and you can classify certain storage into buckets like gold, silver, bronze, right? Maybe gold is your SSDs and bronze is kind of like your rotating disk, right? So there's this policy that's associated with the varying levels of storage that are available. And something that's also different, right? Because we're talking about software and we're not talking about, like, large racks with rotating disks in them. Day two operations, like, what happens when you need to perform maintenance on these storage systems? It's inherently different, right? Because you're not dealing, I mean, ultimately, you know, diving into the details, you still have disks that are out there. But because it's all software-based, how you perform maintenance is inherently different than doing a rebuild on the radar rack, right? So just kind of a... Because I'm a very visual type of person. So this is effectively NFS. This is what NFS looks like. We have a host. We have two physical disks there at the bottom. And on top of that, those two physical disks, which might be combined as an LVM group or whatnot, you have that software-defined storage layer sitting on top of that, right? Which is just if you were to put up a NFS server, right? Or do, like, an app get install NFS or yum install NFS, right? And from that, you expose effectively a volume or some entry point into this NFS server, which gets mounted onto other systems, and that's how the end user consumes that storage. So since I had talked about vSAN, so this is on the far extreme. This is a slightly more complex example of software-defined storage. In the VMware vSAN world, you need a minimum of three nodes, and you have rotating disks, which is there at the bottom. And then you have SSDs, which are kind of sitting above that. That acts as a caching layer or a write cache layer so that the rotating disks, when you want to write to this particular vSAN, you store all the writes there and give the rotating disks enough time to catch up so that you can persist into the physical rotating medium right there. And so that vSAN layer, what it does is it takes all three of those servers, takes all that storage and makes it look like a globally accessible pool that straddles across all three of those hosts, and when you provision vSAN volumes, it comes out of that globally accessible pool and you present them to your host. So a slightly more complex example of what a software-defined storage network looks like. So one thing I kind of wanted to really kind of highlight is what makes, and the reason why I picked NFS and vSAN, what makes NFS and vSAN special, right? And it's kind of the idea is that they're all software-based storage platforms, right? You don't need any kind of specialized hardware, a special purpose-built appliance, some kind of storage array, you know, some kind of rack with the storage controller in it, so it's literally software that you could download, right? So in the case of NFS, you can do an app get and install NFS. And in the case of VMware, it's slightly more complicated, but you can install in your ESX installation media. If you have the licensing for it, you don't need any special hardware, you just have your systems with your SSDs and your rotating disks, and you can create yourself a vSAN array. So that's kind of just a very high-level overview of storage. So I kind of wanted to jump through that very quickly, because just kind of level-set on, like, the idea of software-defined storage. So next, we're going to take a look at container schedulers. So who here has used Meso, Swarm, or Kubernetes? Show of hands. Okay, so a few people. So I'm going to kind of define what, you know, go over what a container scheduler is and for the people who haven't used any of those. So a scheduler effectively takes a container and it places that container in a fair and efficient placement within your compute cluster. And you typically want to have the placement of that workload adhere to a certain set of constraints. So, for example, if you want to deploy three instances of this particular container, it makes a lot of sense that you don't want to have all three of those instances on that container appear and start on that same piece of compute, right? So you would want, ideally, if you want some sort of high availability, you'd take those three instances and say, hey, I want you to make sure that when you start up these containers to put them on three different pieces of compute, right? Three different compute nodes. So that when one goes down, right, you have the ability to not lose your entire application. And schedulers also have this idea that you can quickly and deterministically dispatch your job. So what I mean by that is if you have a particular application where you have 10 instances of your container installed and running, and you want to scale that application up to, like, 100 or 1,000 instances, you want to know how long that's going to take so that your application is fully scaled out and fully available, right? And also, you want to make sure that your container scheduler is robust and tolerates errors, right? So anytime you're going to deploy anything into production, if you have a particular piece of compute, you know, go offline because of a hardware failure, you would want to make sure that those containers get torn down on that host, or in that case they're just lost, but those containers will get moved over to a different piece of compute. Your application will come back up and then it's available again. So when I say containers, like what types of containers, right? So containers, I'm talking about in the sense of Docker. So Docker, you know, see here at the conference all, you know, the past couple of days talking about Docker. Mesos is also another containerizer provider. So Mesos, Apache Mesos is an open source project and it does container scheduling. And funny enough, Mesos also supports Docker as well. And then there's also Rocket. So, you know, get the guys at CoroS and their booth out there so you can talk and find out more information from those guys. So those are the types of containers that I'm talking about. And when you schedule work, right, the container scheduler is also effectively also a cluster manager. If you have your cluster and you have your certain pieces, your certain compute nodes that are running, if you have a failure on that node, you want to make sure that the task gets moved over to a different node. And so that's exactly what the definition of a cluster manager is. And when those workloads get placed on your pieces of compute, you want to make sure that those tasks are placed based on resources. So most, like as an example of Apache Mesos, when you want to define a particular job or task or an application that you want to run within your cluster, like the three criteria that are most common with Mesos is, hey, this container requires X number of CPUs, it requires like two gigabytes of memory, and it requires this much disk, right? So tasks are launched on these pieces of compute based on the resources that are available in your cluster. And then there are also operational constraints. So if you're going to launch a container, I've already used the example of, hey, if I'm going to launch three instances, I want to make sure those three instances are on three different pieces of compute. Now, here's like another example of that, is if instead of on three different pieces of compute, maybe this particular application needs a kind of a higher level of high availability, and maybe instead of saying, I'm going to put these three instances on different compute, maybe it's I want to put these three instances in a completely different rack. So you can do things like that with container schedulers. So a lot of the modern container schedulers today, they have some, most of them have some pretty cool features, and one of them is the ability to create your own custom scheduler. So if you have a particular application or workload where there's some intricacies about that application that the default scheduler cannot provide, you have the ability to create your own scheduler, and you effectively take the resources that are available, find out what placement is best for your particular application, and that custom scheduler then decides where it should place those container workloads, where your application is going to land. So like some examples of what you could customize is like the runtime availability, going back to having your application installed in multiple racks. Maybe you need to have three instances in each, or three instances in each data center row, right? Then you have maybe this application has some built-in fault tolerance where it's doing real-time fault tolerance, and when you lose one node, the next node can pick up and continue on. And maybe you have a particular application that does some like graphics processing, right? And you have within your cluster you have GPUs that are highly effective at processing graphics, movies, doing audio, video, decoding, and you want your workloads to land on those nodes because they're better suited to process that particular application. You have the ability to customize that. So I've got the cool little guy here with the different shirts and stuff like that. Customization is good. So I kind of wanted to take a look at one particular container scheduler because there's a lot of them out there. Oddly enough, it's the one I'm most familiar with. So I kind of want to talk about Apache Mesos. So like I said, it is a container scheduler, right? And it's an open-source project. It supports Docker workloads, and then Mesos has their own containerizer that's kind of a lighter weight version of a container. It does container placement or it deploys your containers based on CPU, memory, and disk, right, is the three kind of criteria that I said. And it also provides you to do user-defined constraints. So you place these two in this particular rack and whatnot. I have a list of logos here. So Apache Mesos isn't a new thing. It's been around for quite some time. It's in production and all these companies have some Apache Mesos installation and so if you use any of these services today, as an end user, you've been using Mesos, right? Because all these companies have these installations at scale at like thousands of nodes. So why did I pick Apache Mesos to talk about? So Apache Mesos has this cool idea where they have a technology called Mesos Frameworks where it allows you to customize placements of your Docker containers or your Mesos containers based on what your application needs. And the way they implement this, and now we're kind of diving into the weeds about some of the details of how a framework operates, but the framework implements two things. It implements a scheduler and then it implements an executer. So the scheduler is the one that gets resource offers to it and then that scheduler, this custom scheduler for your application, then says, hey, I'm going to deploy this particular application to this node in the form of this thing called an executer, which is effectively a container that is going to use some CPU memory and disk and then it's also going to deploy your application. And in the Mesos world, they call this the offer accept model. So kind of a cool little animation, offer up memory, you get an application back that's using X amount of memory. Anything that you're not using gets returned back to the pool. And this particular framework is custom, if you created this framework for your application, this is a custom framework for your application. So you could have many applications running in Mesos where it may have its own framework. Or maybe it's the default framework or the default mechanism for scheduling is sufficient enough and you can just live with that. But the cool thing is it allows you to completely customize how your application gets deployed in this environment. So some more, so that process, what I had just described, this is kind of the cool little diagram for it. You guys can have, I'm going to post the slides later, it's probably up on Scales website already. Yeah, this just kind of goes into a little more detail. Resources get offered up to the control plane, framework gets the resources, decides to deploy an application, send it down the compute. One, two, three, four. So now we kind of talked about software defined storage and we talked about container schedulers. And yeah, there's a lot of information that's kind of I already talked about. Now we're going to talk about what happens when we take these two kind of independent pieces and kind of bring them together. So I kind of had this idea back in September 2016 that said, hey, you know, what happens if I create a software defined storage framework for Mezos? Like, what does that mean? So I kind of went off and I said, hey, I'm going to find, pick a particular piece of software defined storage. In this case I use ScaleIO. And I combined that with the technology of Mezos frameworks. And it kind of opened the idea of like a whole kind of way of thinking about applications and how they get deployed and how you really ultimately consume them. And so it was released in September 2016. We're kind of now on version 031. It's open source, so you have the GitHub link there. You can go ahead and take a look at the code and, you know, contribute or open issues if you find any. But it's this idea of combining two different things, right? That when you combine them together, it makes it better together, right? So I got the Oreo cookie and the milk. I don't know anybody who hates that combination. So, as I said, ScaleIO is a software defined storage platform. And I'm just going to have one slide about it because really ultimately this can be done with any software based storage platform. But I chose ScaleIO to use in particular. So it is a software based storage platform. It has scale out block storage. So what's cool about ScaleIO is that if you start to, you know, if you need more capacity in your software based storage platform, you can either add more disks to the ScaleIO nodes that are there or you could add more nodes. And the cool thing about ScaleIO is that it linearly scales. So as you have more nodes, the number of IOPS and performance grows along with it. And that kind of gets back to this whole idea of having an elastic architecture, right? So add drives, add more nodes. And if a particular node goes down, ScaleIO, because it's a software based storage platform, will go ahead and rebalance itself. So that's kind of why I had hinted earlier on that when you have failure or when you have changes in your software based storage platform, the maintenance operations change. And in this case, there's nothing that the end user really needs to do, right? It basically fixes itself when you have a particular node failure. And then the other cool thing is, because this is a software based storage platform and it doesn't require any kind of specialized hardware, you can download ScaleIO as RPMs and DEB files and install them anywhere. It's infrastructure agnostic. So if you wanted to install this locally, like, you know, on physical bare metal, go ahead. If you want to install this on virtual machines, go forward. Or if you want to take it into the cloud, like an AWS or GCE or Golden. And you can go ahead and download it and give it a try. Yeah, that's kind of all I'm just going to say about that. So what does this software-defined storage framework really mean? So if kind of the idea of this framework is to install an application, right? Find the best place and the best mechanism to install this application such that it performs in the most optimal way it possibly can. And if you have a framework that installs a software-defined storage platform, it manages the installation and the configuration of that platform. So effectively what you do is you deploy the framework on all of your pieces of compute. It installs a software-based storage platform. And if you do that from day one, as soon as you stand up your Meso's cluster in this case, you'd have persistent storage access that's native on every piece of your compute from day one without having to know anything about ScaleIO. So that's kind of the cool idea. And because ScaleIO is a software-based storage platform, it provides global access to external storage so that if you need to provision a volume from any one of these nodes, you can go ahead and create this volume and no matter where your container workload hits and lands on a piece of compute, because it exists on every node, you're basically guaranteed access to these volumes. And because it is all software-based, like I said, if you had a node failure and you need to rebalance the data that exists in this software-based storage platform, there's no storage array, right? It all exists on software. You're using the local disks that are attached to your individual pieces of compute to make it look like such that it's a huge globally accessible pool of storage. And because it's infrastructure-agnostic, you can literally deploy this thing like anywhere. So why is that important? Why do we even care about that? So if we take a look at containers today, if you look at a lot of the workloads that are being run in container form, many of those containers are long-running persistent applications, right? They have applications that require state, that require data to be saved somewhere. Even something as simple as configuration is state. So if you take a look at... I have on the... I guess it's your right-hand side. If you look at the right-hand side, this is a snapshot of Docker Hub. So these are the top 12 applications that you see in Docker Hub. If you take a look at that, the top seven of those 12 are actually persistent applications. So if we take a look, I know it's kind of hard to see, but you know, you have Mongo, you have MySQL, you have Postgres, you have Elasticsearch, and even WordPress to a certain degree, right? You write an article, that article gets created and saved somewhere, right? So these are persistent applications that are running in containers today. So if I do have these persistent applications, like Postgres as an example, right? And I start up Postgres and I need to save that data. If I don't save that data somewhere else, when I take Postgres down and I shut Postgres down, the container dies, where does all my database data go, right? So a lot of the container schedules, what they decided to do was, hey, I recognize that this thing is landing on some form of compute. This piece of compute has some sort of access to local disks. I will just take that data, expose a folder or a piece of that local disk to that container, and you can save your data there. So you bring up Postgres, you start writing to this specially carved out section of data, or this directory where you can store your Postgres data to, which just exists on your local disks, on that particular piece of compute, and you're all happy. You stop your Postgres database that lives in this container, bring it back up on that same node, and it recognizes that that mounts there, and you have your data, no data loss. Now the problem is what happens when you kill that node? What happens if that node, for whatever reason, has a hardware failure, goes out for maintenance, or has some sort of issue where you can't get at that local disk anymore? So if you're trying to do something like that, and you don't have any form of high availability, is effectively what we're talking about, being able to take that workload, that container, move it from one host to the next, and have your data, you're probably not going to run that in production because all of a sudden the finance database is gone and you're trying to recover that, that's probably not a good story for your higher ups and your bosses, right? So what's the solution to that problem? So like I kind of already said, you want to make sure that that data lives external to that piece of compute. And that's kind of where ScaleIO comes in, right? So you have the software defined storage platform where if you need a volume that's globally accessible on every piece of your compute, if you need to bring up that container on a different node because you have access to that storage platform, you can reattach the volume that it was using on a different node so that you have all your data. And that's kind of like the value add of why, you know, like software defined storage or software based storage platforms, why it means something, why it's important for containers in production. So this framework, I'm just going to kind of mention a couple of things right here. So this framework installs your storage platform, but there's nothing there really to hook up and provision the storage, right? So the framework installs the storage platform, you have your containers, you can run them, but how do you hook up your container to provision a volume from the software based storage platform, say that 10 times fast, how you take that software based storage volume and how you connect the two, right? How you take that volume and connect it to the container. So the framework also leverages two other open source projects. The first one is called RexRay. I'm using it special purpose built for what I'm trying to do, but RexRay, in the grand larger scheme of things, it's a vendor agnostic orchestration engine that supports many different storage platform back ends. So a fancy way of saying is if I'm in AWS and I need to provision an EBS volume for a container or for practically anything like, you know, an application that's running on the EC2 instance, I can use RexRay to create that EBS volume, attach it, and mount it to wherever I need. And in this case, I'm using RexRay to create the volume and attach it, but attach it to that container instance so that when I need to save data, I'm saving data to this software based storage platform in the volume that's created there. And another one is Mesos module DVDI. So RexRay is effectively a Docker volume driver plug-in, so that kind of satisfies, like, use cases for Docker. And for Mesos module DVDI, it's essentially a library that hooks, you know, it actually hooks up to RexRay, so it just has one level of indirection, where it calls RexRay to create volumes and attach volumes to non-Docker workloads in Mesos. So effectively, if you're running a Mesos container, if you want external storage for that particular application, it will go ahead and provision the volume and attach it to that non-Docker workload. And both of these projects are open source. Like I said, RexRay, I'm using it for something very specific, but it supports AWS, GCE, DigitalOcean, RackSpace, VirtualBox if you want to just do local testing. And then, you know, because we are Dell EMC, it supports a lot of Dell EMC storage arrays. So we now have the ability to create volumes, move them around in our Mesos cluster. And so what does this really mean for your application, right? If you have a node failure in your Mesos cluster, you effectively have high availability for your applications. That workload, whether it's Docker or Mesos-based, that container will move from one node to the other. That software-based storage platform volume will follow that workload to whatever piece of compute that it lands on and it will have access to its data. But kind of a little bit more important than that, why is this idea of this framework like a good thing? And kind of two big things right here is that, one, this framework manages your software-based storage platform, which means that it's the one that's dealing with all the APIs for the storage platform. And it's also dealing with all the changes for your container scheduler. So it's making the API calls to Mesos so that you don't have to deal with that stuff. And it's also making the API calls to install and configure your storage platform. And in the end, to you, as the end user, you really don't care, right? So you deploy this framework, you get a software-defined storage platform, which you really don't have to know how or why and what's required. There's actually three metadata nodes that get installed so that they can do arbitration when a frame drops. But as an end user, you really don't care, right? You just want to have this storage platform available for your applications and provision volumes from it. And that's effectively what this framework does. And then that, in turn, gives your containers that whole production level readiness so you can put into production and should something go wrong, everything's golden. So I kind of have this story that I've been kind of building up, and the last section of this is what happens when I take this into the cloud? So if I take this into, like, AWS, for example, because that's what I had implemented, like, what does this mean? So we kind of already talked that software-defined storage, by definition, has APIs. It has the ability to make API calls, manage them, do certain things like maintenance operations and whatnot, right? We all know that cloud is, like, one of the reasons why cloud is so awesome is that they themselves also have APIs and DevOps processes use those APIs to stand up all these EC2 instances and whatnot. So, like, what makes all this stuff kind of work and what's, like, the whole crux of why all this stuff is important? It's all the APIs that are available, right? It's all these program interfaces that you can create these applications, manage your application, and manage the cloud. So if you have applications that have APIs where you can say, hey, I have this particular application, I can, you know, make API calls, and I can deploy and configure this thing, I can take a look at, make these API calls and monitor my application, I can actually determine the health of this particular application, and I can say, hey, if something starts to go south, like, I have some sort of fault in my application, I might actually be able to do something about that now. And so the idea was if you take the, you know, you have an API where you can monitor your application, you can actually start to have your application go ahead and fix itself, and that's kind of, I have this cool picture from Short Circuit, Johnny5 is alive, so this application starts to, like, pick up on its own and, you know, start fixing its own problems, right? But kind of more importantly to that, if we have an application that can, you know, determine that it's got a problem, it's got an issue, and then we see that since we're living in the cloud now, so in this example we're using AWS, and we have our software-based storage platform and our Mesos cluster within the cloud, and we have AWS SDK bindings that are available in 10 languages, and we know that we can manipulate AWS using these APIs, you can actually have applications that can start changing their environment that they live in, and I thought that that was kind of a cool idea, and so if you need to do some sort of maintenance operation or you need to do some sort of remediation, it's actually aware of the environment that it lives in, and I kind of, you know, obviously this is a gross exaggeration, but you self-aware application Skynet kind of stuff, right? I only implemented a small piece, so it's not like we don't need to, you know, start hunkering down and building bunkers or anything, but it's just a cool idea, and so, like I said, frameworks that can monitor themselves, self-remediate this software-based storage platform. So what I'm actually going to demo in a second here is this software-based storage platform has a storage pool that's starting to approach full. It identifies and sees that it has no more storage left, right, we're starting to consume all this storage. The application itself, the framework, identifies this particular problem. Since we're living in AWS, it goes ahead and creates the EBS volumes and adds those EBS volumes to the software-based storage platform, and we go from we're almost full to we're okay, and so that's what I'm actually going to demo here, and hopefully this works. So the slides will be available. The kind of more information about the configuration, all the open-source software that's being used right now. I just want to kind of go over about what this looks like in the backend because it happens very quickly, and that's actually a good thing for you as the end user because the more the details and stuff that you see when the things are going back and forth and all this stuff, that means that you need to be aware of them. This whole process just starts with a simple curl command to install this software-based storage platform. I'm going to do that today, and hopefully it'll work. So we install this framework. Like I said, it's broken up into two components. The scheduler gets created. The nodes offer up their resources. The scheduler says, hey, I can use those resources, and I'm going to install this software-based storage platform, Rex Ray and Mesos Module DVDI. Those are the two pieces of the two glue components that hooks the storage up to the container, and we're going to simulate pumping a whole bunch of data into the storage platform. We're going to get a warning saying, hey, we need to seriously take a look at this because it's about to go south. Effectively, the scheduler is going to say, hey, I'm going to make an API call to Amazon, provision volumes, attach them to the storage platform, and everything will be fine. So hopefully this works. This would be really bad if it didn't make this bigger. Tell me when you're pretty scared. So maybe take it for a test drive for Mesos to just take a look at what we're looking at. So this is the Marathon UI, and this effectively, this is how you want. Marathon is what you use on Mesos to launch an application, and the funny thing is that Marathon itself is a framework. Maybe this is too much information, but Marathon itself is a framework, and we're going to launch a framework using a framework. So this is the UI that we're going to see, and we're going to come back to it a little bit because it's how we can take a look at the progress of our framework. And I copied this curl command here, and I'm just going to curl that to Marathon, so it's just effectively a Marathon API call to say, start the ScaleIO framework. And that's a bunch of nonsense. Don't worry about it. Please work. So ScaleIO scheduler, we're going to take a look at the user interface for this particular scheduler, or for this framework. So we have three nodes in AWS. These represent our compute nodes, so it's installing the ScaleIO packages right now. It's creating the ScaleIO cluster and initializing it. Please work. It takes a little bit of time because it's creating the cluster. It's taking the three nodes that are there. It's actually standing up the metadata instances. It's creating a three-node high availability cluster. I have three 180-gig disks in EBS that are attached to each node. Those disks are getting contributed to that storage platform. We're installing RexRay, which is one of the glue components that we need to provision volumes for your container. Since actually ScaleIO is already up and running at this point because we're only just installing the glue, as RexRay is getting installed right here, we can already see that the capacity here, we already have some capacity that's available. And then after RexRay gets installed, we can see that ScaleIO is running. So how cool is that? I curled the command. I didn't even know what was actually happening under the covers, but at this point, if I wanted to create a container like Postgres and I wanted to save the data into ScaleIO, no matter what node it gets placed on, because ScaleIO is running on every node, it will provision that volume, and you as an end user just think of it as a service because it's just running in the background and you didn't really need to do anything to start this. So that's pretty cool. I have a software-based storage platform running on my Mesos cluster. I have some capacity, and I can see that because obviously it's new. I don't have any of that storage currently being used, so that's the percent. And I created an arbitrary threshold. So as the ScaleIO storage pool passes 90% capacity used, that is what's going to trigger the, oh my gosh, we need to allocate more storage to this platform. And so kind of to simulate that, I was planning on doing it so it would actually dump a whole bunch of data into it, but that took a little bit too long. So what I did was I created a little API call that says, fake like I'm going to dump that much data so that we don't have to sit here for the next five minutes and, yeah, I appreciate your guys' time, and that's not good. So I'm going to take, right, this represents the UI, right? Where I had, where the UI for this particular framework, I made it so that the REST API call that I'm going to make is in this same exact web server here. I'm going to open Postman. So I'm going to open Postman here. I copied that URL where the REST API sits. The header here, you know, normal JSON stuff. You know, effectively saying, create this much fake data, right, this amount right here. And I'm going to call the API, cross my fingers. So we can already see that the fake data is already there because it's not actually generating data. So fake data, used fake data is here. We're at 98% of used capacity, right? So we've passed that 90% threshold. Since we're in Amazon, oh, so it's actually creating, kind of missed it, it happened a little too fast, but we're creating right here this last one terabyte volume. So we created three of them to represent each EBS volume that we're going to attach to each node to allocate more capacity to this storage platform. And hopefully, so it should be finished pretty soon. So now it says it's in use. So cross your fingers, hopefully this works. So what we should see is we created three one terabyte volumes. What was there before was only three 180 gig volumes, or EBS volumes. So we're adding a whole bunch of storage to this pool. So we should see the capacity. Once the EBS volumes get added to the storage platform, the backend capacity for the storage pool increased dramatically, right? Because we're adding, you know, effectively three terabytes of data to this. And then hopefully the percent used is going to drop, right? Because now there's more capacity available and this fake amount of data that we've used is far less as a percentage, right? Right, so the framework. So that Mezos framework is the one that's actually calling the... Well, so for this particular software-based storage platform, that framework is tightly coupled with that application, this storage platform. So that framework is the one that actually makes, recognizes because it knows about this storage platform, it knows how to monitor that state. It knows that the storage pool is approaching full, right? Because this framework and this application are tightly coupled together. And when it recognizes through the API calls of that storage framework that we're approaching full on this storage pool, it will go out, that framework will go out and make the AWS calls to create these EBS volumes. And because we can also, you know, after those EBS volumes get created, because this framework is so tightly coupled with this application, the framework can tell this application, give it a kick in the butt and say, hey, by the way, you have three one-terabyte volumes that are now on, you know, that have been connected to your AWS EC2 instances. Go and attach them to your storage pool. And thankfully enough, it has. It actually did work. So we have a whole bunch of extra capacity. The threshold, we haven't changed because it's just an arbitrary threshold about create more volumes when, you know, getting in trouble. And the amount of used data has now dropped to 14%. So by simulating, like, kind of a, like an, oh my God, we're not going to make it kind of scenario where we're starting to approach full. And I don't know if you could have any, like, backup administrators or, like, sand people who are, you know, deal with sand stuff. Like, when you start to approach full, if you hit full, like the world just kind of crashing down. And so the idea of this talk is, hey, you have this ability to monitor your application. You take all this stuff up with your container scheduler, move it into the cloud. You now have the ability to change the environment from underneath you and fix, you know, effectively fix your application's problems. And I guess, yeah, we'll just go to questions there. So if Mesos has this capability, why, what's the use of ScaleIO? Any node can provision this new storage with, if it's using Mesos, right? So, right, but the framework installs and configures the storage platform on your Mesos nodes. If you want persistent storage for an application, yes. Original question, which one is, which one is actually using the AWS API? Yeah, so let's go back here. As you mentioned, Mesos was actually doing that AWS API call. This Mesos framework, which is, effectively think of a Mesos framework as a specialized plugin for Mesos. It's a plugin that you create that has special knowledge about your application, and in this case, it's ScaleIO. Does that make sense? Okay, so Mesos is using ScaleIO, which internally is framework. So Mesos isn't using it? Well, it's using it in the sense that if you start a workload, like if you start a container and you want to provision an external volume, Mesos under the covers will call Rexray. I know there's a bunch of stuff. Maybe we can talk about it offline, but Mesos will start the container, make the hooks to provision the ScaleIO volume, and attach it to that container. You pointed to a lot of the GitHub pages. We don't have to pull down these different components. It all comes with Mesos. No, it all comes with the ScaleIO framework. Oh, it comes with the ScaleIO framework. Yeah, so you deploy the ScaleIO framework to Mesos, that thing installs and configures ScaleIO and all those other glue applications in your cluster. So effectively, I curled that command that says launch this framework, and it under the covers took care of all that stuff for you. You didn't even have to be aware that Rexray was getting installed and all that stuff. I really just only had the UI up there so that for me I knew everything was okay and so that you can see when we faked dumping a bunch of data into this storage pool for ScaleIO that we could visually see that it's starting to fill up and then when we provision the volumes, we can visually see that we have enough capacity now. When it starts fresh, I'm going to have to install this stuff and configure it so I would have to install Mesos. I have to install ScaleIO. Yeah, so Mesos, just like Kubernetes and just like Docker swarm, those are container schedulers. So you need to have, in this case, Apache Mesos, this particular container scheduler to use this framework. Got a question in the back. Yeah, so I just want to make sure I understand this. We're using SEP right now and so we have 300, 400, 500 Mesos slaves. Every one of these slaves running containers and on every one of them there are mounts from SEP to make volumes available to Docker containers so that when they spin up, if they need that storage, they'll mount that storage in the container. So with this, if I'm understanding this correctly, the volumes that are available through ScaleIO will be visible on these slaves without even looking at the containers? Visible on the slaves. In other words, if you do a DF on a Mesos slave, you'll see the volumes. Yes, but they're using, so in the case of like Docker or Mesos, right? Right. It's using, you know, shoot. Yeah, it's only for mounts. So effectively, the whole idea of containerization is that when you have a particular workload and you have a volume for that workload, only that container can see that particular volume. So you have a security layer? Oh yes, most definitely. It has to be somehow some token or something that authorizes it to actually utilize the volume. What's the underlying technology for... I can't remember the name of it. No, for Mesos containerizer. Which Docker uses to, I forgot. They have security built in such that if I have a particular container, like a Postgres, and you have another container that comes up that's elastic search, you can't go mucking at each other's data, right? It's completely isolated. And so when you provision a volume for your workload, so if you want to provision a ScaleIO volume for Postgres, for your Postgres database, only that particular Postgres instance will have access to that volume and can rescribble data to that volume. Okay, great. So you can have shared volumes in this or not? Because we find that it depends, like, some of the file systems allow shared volumes, others, they have to be exclusive. Yeah, so two caveats, right? One, it depends on the application. If you're going to share a volume, you better make sure that those two particular applications can work well together so they can read right to that particular volume. And then two, the underlying storage platform also has to support the ability to share volumes. ScaleIO does, but like I said, it just depends on your underlying application. So like as an example, right, for VMware, because I know VMware and so if you need to create a data store, you create a data store out of one and you have the X number of ESX hosts that are there, you share that one between those ESX hosts, right? That's how VMotion works, right? But that underlying data store, your VMFS, is smart enough to know that we can share this particular volume with this one and... Any other questions? Oh, I actually sold the entire time. Thank you guys for showing up. I know you guys missed beer for me, so I do appreciate it. The launch is itself without actually doing something. The only people who've had that are staff and the other three presenters of the day. We know who they are. Yeah. Cool, thank you. Hope you enjoyed it. Technically, there's no reason why you couldn't do this. So I just happen to know that I'm not familiar with it and I understand it, but... Well, it's free. You only pay if you want to support it. But you technically could do this with other problems. But if you wanted to do something like this for like an FS, you could do it. I wouldn't do it, but I'm just saying you could do it, right? Yeah. Right. I hope it's not from my outfit. Yeah, sure. So that means that initially, I need to have an initial finish. And then I also need to have a fill-up date for this right here. Yeah. I've been wanting to do this physically as an employee. You can fill it in. It's not sure. It's like, what do you think about it? You're just thinking that it's just lost. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. In the real world, it's like not a thing to do with it. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Right. Okay. Yeah. Yeah. Okay. Yeah. Yeah. Okay. All right. Yeah. Hello. Okay, we're going to get started. Our next talk is Minio, an open source alternative to AWS S3. Our speaker is A.B. Piriasami. Minio is an Amazon S3 compatible object storage server built for cloud native applications. It has advanced features like erasure code, bit rot protection, and lambda functions. Application developers often deploy Minio in a dockerized container and orchestrate it with Kubernetes or Mezos. So without further ado, I'll give you A.B. The most confusing part of Minio is Minio or Minio, how do I pronounce it? I call it Minio. The name came from minimalist object storage. I find most of the community guys calling it Minio. That sounds fine to me too. Minimalist I.O. still is storage. As long as you recognize the board, it doesn't matter nowadays. My background I came from Gluster FS, which is a distributed file system, now part of Red Hat. Back then, when I did Gluster, I was not a storage guy. Even now, when people see me as a storage guy, it's almost like insult. I'm not a storage guy. I'm just a software developer. I happened to be in the storage space last time. Last time I wanted a job where I can hack free software and no one gave me a job. Me and my bunch of my friends, we created Gluster FS. So this time around, towards the end of Gluster FS, the user base changed a lot. When I started Gluster, when we built the file system for few users who had this requirement, when we saw outside of those user base, they said our data is enterprise data and this is not their use case. When I asked them what is enterprise data, they said SQL Server Exchange Oracle Database. This back like in 2006. That was the Gluster time. By the time like Gluster became part of Red Hat in 2011, Amazon S3 became very popular and the user base applications around Amazon S3 was quite different and the scale was even larger than Gluster. So Minio made me rethink about those fundamentals. So why did I pick the name Minio? So that's the background. Why did I pick the name Minio? That was the domain name I got. The shortest domain name, I looked around for a lot of names so it was something easier to remember and it is around minimalism, which is the design ideology behind the project. Why that word? It represents something minimal, beautiful piece of artwork and all other mammals are taken. So this is all I got. The design philosophy last time when I did Gluster was influenced by Emacs. I'm a Emacs guy. Emacs like ideology. The much of the design of Gluster came from an operating system unheard of. It is called Heard, H-U-R-D. It's a brilliant operating system designed to ahead of its times. It did not take off for various reasons. It was the operating system kernel. Free Software Foundation was supposed to deliver. Gluster was very much a heard design brought to file system. But this time around the reality is if you look at the object storage, compared to the past when I did Gluster, Gluster was a user space file system. It borrowed ideas of a user space operating system design and early on no one believed that you could write a file system in user space. Linux even said user space file systems are toys. And it was in C. But look around today. File systems, storage systems are written in Python and Java like languages. The argument of user space and kernel, it doesn't exist anymore. And now if you look at Minio, the way I look at Minio, it's actually simpler than Red is, MySQL, MongoDB, anything else. And if you look at any of these building blocks application developers are using, they have no idea about storage. In fact, today, average Node.js, Ruby, Python developer, they can set up Redis and start using it. They have no issues. But most of these guys nowadays when I talk to them, they are like, what is Mount command? What is FS tab? Why should I be rude? And this is the user base today adopting these cloud native architecture, building really powerful, cool applications. But they don't understand infrastructure much. And luckily, Amazon got rid of the POSIX legacy requirements and convinced the application ecosystem that you no longer need those. But once you eliminate the POSIX requirement, it's no longer hard to build storage. When I started Gluster, they told me that it's going to take 10 years for me to just get POSIX right. And by then the other file systems that were already out there will be so much ahead of Gluster, there was no point in even starting. And today, when I'm starting, there is no POSIX. And in reality, storage, an object storage is simply a get put HTTP server. It's simply a blob server. How hard it can be, if you look at a language like Go, in fact, Minio is written in Go. Hardest part of writing a web server is the web server itself, which is part of these modern programming language frameworks. I originally wrote the prototype in C++, V8, and JavaScript. And then we did it as soon as we got confidence in Go. And even the very first release was in Go, there was no turning back. The real reason for Go was it was a practical language. It is almost boring. But the community around it is amazing. And we picked Go. Go basically had a very good, powerful web server. And a web server with the REST API that is compatible with Amazon S3 APIs is pretty much what is needed to build an object storage. There is no complexity in there. There are parts of it like erasure code, bidrock calculations that requires break caching, shock calculations, all of that is still not very hard. We hand-optimized them in assembly language using all the CPU instructions. Yet these are just functions. Once you wrote them, they just work. Entire object storage in Minio is a 15 megabyte binary. You just download and then run it on from Raspberry Pi to high-end Linux machines, even natively on Windows and NTFS. It pretty much runs everywhere where Go runs. Go is widely portable, too. And the parts that we wrote in assembly, we made sure that they are all equally portable. In fact, they take advantage of commodity CPU instructions like SSE, AVX, AVX2. Even on ARM, we take advantage of neon instructions. Yet from an end user point of view, you really don't know any of this. You don't know what mount command is. You are not a root. It doesn't really matter. You should be able to deploy storage. Why simplicity is an important requirement for me? The reason is if you want to do very large system, very large infrastructure, it is the simplicity that you should measure. That's the real measure of scalability. When I did Gluster, what I realized was as you start scaling, Gluster can easily handle petabyte namespace. Most of those deployments, the high-end deployments are actually multi-petabyte in a single namespace. And as I started pushing the systems hard, I started seeing the emerging newer applications like Pandora and others. What they did was they actually deployed many Gluster deployments in smaller namespaces. The reason was not because Gluster did not scale. It was for operational reasons. They actually started deploying many smaller Gluster deployments at larger scale, and it kept scaling. And if you look there, the details, the scalability is to actually keep, to scale, you keep things simpler. Say if you start with a large distributed system and you keep on adding nodes, when you add your thousandth user, when you add your millionth user, when you add your thousandth node, you will be sweating. It is a really hard problem because the algorithms that are chosen for a smaller scale system, which is the most popular, most common deployment, is around 32, 16, 64 node. They break when you start scaling to a large number of nodes. And anything that is optimized for a very large number of nodes don't run very well at the small scale. But anything at a large scale, very large system, pretty much behaves like this. You have lots of eggs, and you put them in a basket. As you start building bigger and bigger and bigger basket, it is unstable. It is very complicated to build a very large giant basket. That's a problem of building distributed system. It's not just through in storage. In any distributed system, if you put them in a single namespace, it starts phasing problems. It becomes complex. And when it becomes complex, even though it functions, it's no longer scalable. That was the fundamental notion we had to get rid of when we built cluster, that Minio, this time. It has to be ridiculously simple. To the extent that a Node.js developer could use, then now you can scale it to very large deployment. We'll go into the details how exactly we do it, but that's the design ideology behind Minio. And also, the idea of doing one thing right, and there is only one way to do it, don't do 10 things. We don't want to compete on features. Like I say, it's very tempting for file system designers to actually add lots of features. We can do NFS, we can do HDFS, we can do block storage. Gluster actually is a stackable design, a pluggable modular design, and you can write lots of plugins as translators. Gluster's strength was, in the early days, someone called it a Swiss Army knife. They called it proudly Swiss Army knife. I took it as complement, too. But after a lot of experience, I realized that wasn't the complement. A Swiss Army knife is actually not the best tool for any job. It has a lot of capabilities, but it's a mediocre capability in each one of them. The real way to do it is you build an object storage that's very good at object storage, but don't do NFS if it's all multi-protocol, grand unified theory of storage. Just do object storage. Now, luckily, the industry also changed in the last few years. The notion of primary storage and secondary storage, that is, primary storage as SAN and NAF, and then the secondary storage steps and the slow storage devices, that idea is going away. They are good if you are running SQL server exchange, web logic, those type of applications. In the modern cloud world, if you look at the application stack, all the operational data is going to various kinds of operational data stores, key value stores, all the way starting from Redis, MongoDB, MySQL, Chartered, Not Chartered, to Cassandra on the higher end, the operational data that is typically heavily mutable, they are sitting there. The mutable data, this is the data that you are writing a ticketing system, all the unsold seeds are active and it needs to be in memory, it needs to be fast and coherent. They are all sitting in databases. Could be Redis, MongoDB, any one of them. The operational data store have moved away from SAN. Early on when you built, when you built an infrastructure, you would say you bought exchange, you bought Oracle Database, you almost always ended up buying a SAN or NAS. Today, if you are deploying Redis, MongoDB, MySQL, anything, any of these persistent containers or persistent services, you are using local drives, local drives could be solid state NVRAM or just SATA hard drives, you are not even looking at a SAN or NAS. These systems are designed to just go distributed, use local drives and be done with it. And that is the new primary storage, meaning cache is the new primary storage. All the changing data is there. But once a data needs to persist long term, it actually becomes immutable or for the most part, it is immutable. And all of those applications are beginning to converge to object storage. HTFS had a good success. Like Gluster, HTFS, if you compare those, Gluster had a good success as a file system that looked like NetApp, looked like NFS, something familiar, and mostly storing files, archive data to music to all kinds of data. And then HTFS had its success in storing continuous stream of logs that did not fit in a database. Mostly temporal data, transaction data, logs, HTFS was a good choice. But HTFS, as HTFS grew and became complex, nowadays what's happening is all these users are beginning to move to object storage. Luckily, the legacy systems are going away and there is a convergence towards object storage. And if that's the case, then Minio need not support file, the legacy protocols like POSIX APIs, they're all gone. Minio is an object storage server and it's only object storage server. What is it good for? For storing mostly immutable unstructured data. Even structured data at very large scale becomes unstructured data, like the logs that you typically stored in HTFS. Now, what does Minio consist of as a project? Minio is a complete stack. It's a full blown alternative to Amazon S3. The server, there is a server, there is SDK, and there is client. Minio is compatible with Amazon S3 APIs. It's completely compatible. If you wrote applications for Minio, say using Minio SDK, you point it to Amazon S3, they run fine. That when we wrote SDK, initially we did not even start with the server. We started with the client and SDK to make sure that it is written for Amazon S3. Until recently, the examples had Amazon as the default example. And they run fine for Amazon. Now they should run fine for Minio server. It's important for us to retain compatibility to Amazon. Say like sometimes community guys ask the limits that Amazon enforces does not make sense. Can you relax that? And we defend very hard. We tell them, it's good for Minio if I relax it because then you are stuck with Minio. You can't go back. You start building applications that over a period of time, these guys have dependency on Minio. These are minute dependencies. Then we would do the evil thing of Microsoft JVM started adding extensions. It's a bad practice to do. And we would just defend hard and say, no, that we are not doing it. And it was important for us to be just Amazon S3 compatible. There were even like hard questions from the community. Why not support Swift API? We are all open source. In fact, I would like to call free software. And these guys are like, why can't you support OpenStack Swift API? Why are you supporting something proprietary that is Amazon S3? My question was pick one. I don't want to support multiple different types of object storage APIs. It causes a lot of confusion. Then the documentation will show preference towards one type of API and the other one would be poorly documented. You just don't want confusion. And unfortunately, the standards are dictated by the winner. It's not about what is any time when someone creates a consortium to come up with a standard, it results in the USB, until the previous USB generation. You can't even put this way or that way. It's just hard. Amazon did the right thing of simplifying it. They did actually a really good job with the REST API. In fact, I found Microsoft Blob API to be much more logical, more simplified common sense API as compared to Amazon S3 API. But I understand Amazon was bullied by the users to just keep adding extensions here and there, here and there. But overall Amazon API, REST API is well thought out. It's a good set of APIs. So instead of creating new API or supporting something like Swift API, we just picked one API and stuck with it. And it is all pretty much you need to build a very large scale multi-patabyte storage infrastructure. Could be a data lake, could be application storage, could be container image repository. This is pretty much what you need. The reason why we wrote Minio SDK and the client, first the SDK part. Amazon SDK actually turns just fine for Minio server. If you are already an Amazon user, or say if you are using other services of Amazon, I would say just stick with Amazon SDK. But for the most part, I find our users are just using storage, just Minio, and the other components are other open source building blocks. And they find Amazon SDK to be too blotted. And they also crib about the Amazon APIs at the SDK level are too, it's not written in an idiomatic way. Let's say, let's put it that way. Say like Golang SDK has pointers unnecessarily and they cry about it is not clean, it's ugly. And some of them are auto-generated. For us, we wanted APIs that are clean and simple and easy to understand and a lightweight library. Minio SDK, if you look at it, say if there is a put object API, and then you give, whether you give a stream, a small stream, large stream, you don't need to know how the say a 5 GB object, a 10 GB object is exactly uploaded. Amazon would expose, Amazon SDKs would expose, like you need to upload anything larger than 5 GB, it has to be multi-part, but in fact, anything larger than 5 MB, switch to multi-part. And multi-part API is very complicated. Then they tell you then why not use an upload manager API? The Minio, the idea is that you should have a put object API, and you give a stream, then it is job of the SDK to determine whether it should be a multi-part API or not, how it chunks up. If there is a partial upload that already happened, continue from where it left off, those details should never be exposed. From an application point of view, all you need is, in it are a new client, like Minio.new, and then you would say .put, .get, .remove, that's how it should be, pretty close to a shell command, right? That's what Minio SDK is. But otherwise, it's designed to be more compatible with Amazon than even Minio, and it's by design philosophy. Now the CLI, the Minio CLI, why did we do that? Say if you are an app developer, and then you built an application, then the moment you start aggregating around 200 terabytes, then now is the time you actually bring someone as a full-time DevOps to manage the infrastructure. And at that time, this DevOps is not going to tell that, if you tell them here is a AWS SDK or Minio SDK, you have to start writing tools to manage them. It is very hard. There are a lot of details to be understood. And they would like tools like R-Sync, L-S-C-P, like the core details. And the core details that Unix had for decades, they don't even handle like a few terabytes, like say to have a 10 terabyte drive and try to copy them. You go for a copy, come back, it's somewhere like some Unicode character or something like it. CP, when it dies, it cannot restart from where it left off. It won't show a progress bar. But yet CP has so many options that no one knows why they exist. And the core details did not make sense. So we wanted a core details replacement. And the goal is not a POSIX legacy compatibility, but write a core details alternative just enough like L-S-C-P, CAD, PyPAR, SYNC, like tools that it is, it works for a file system. You can actually use it for your NetApp array, file system array, anything file system, local file system, or you can use it as an object storage. Why build a mountable file system over object storage doesn't make sense. Instead, have the tools. So the whole purpose of writing a file system client on top of object storage would be to support these legacy tools. But those legacy tools don't don't do a good job. Simply replace the legacy tools because applications are anyway doing high level get put API. When it comes to these tools, don't go through POSIX. POSIX is a backward, backward step, right? So Minio client is simply a replacement for L-S-C-P like tools. They are designed to be scriptable. They produce fancy output. I'll show you a quick demo of it. I think we'll have time. There is not much on the deck itself. I think we'll have plenty of time. So that's what a Minio client is. It provides you simple coritals. So a DevOps could manage the infrastructure. So you can actually have Minio server in sync with Amazon or even Google Cloud. All the objects on Minio, so you don't trust Minio, you want to keep all the data on Minio to be in sync with Amazon S3 or all the data. You are on Amazon S3 and then you are worried about an outage. Some people say it never happens. 99.999. So it happened just last week. So if you could have MC mirror command, which is like R-Sync, you can have a copy of all your data on Amazon on Minio, say Minio running on Digital Ocean, Minio running on Google Cloud. It would keep both of them in sync. Sometimes you would use that tool to just keep two Minio running in two different data centers across the globe. The MC mirror would keep them in sync. So it provides all these useful tools. I'll do a quick demo of it later. So now the Minio server. So the Minio server itself, it's a full blown server like I said before. It has few different modes of operation. The most common one is, say I have my laptop and I have my drive and my documents directory, my downloads directory, and then I run Minio on the directory. You simply say Minio run and that directory. There is no installation required with Minio. It's just a static binary. It's a standard practice go based projects. Just download a static binary. In fact, I thought mostly people would download the Minio binary. The community kept asking, why don't you give me a Docker image? They no longer ask about RPM and Dev package, but they ask a Docker image. They did ask brew package, by the way. But we don't have RPM Dev. We have a brew, but brew install Minio. But when they asked me, why don't you give me a Docker package? I'm like, why would you put a 10, 15 megabyte binary in a bigger Docker image? And I kept saying no. And at some point, community started maintaining this on their own. And then we said, okay, if it is useful, then we will do an official release. And 500,000 plus Docker pulls within just last three, four months. And when I look, compare it to our actual download count, it is a tiny fraction that I actually changed our default downloads, everything to Docker. When I asked them, why are you doing this? They tell me that it that is how they deploy Redis, like rails, any other application building block, they see Docker like a app to get our young like, I understand that, right? So it's a packaging and delivering delivery mechanism. So we don't have no problems with that. So we started now tightly integrating with Minio, Minio with Docker, mesos, Kubernetes and bunch of stuff. Now, the simplest mode of running Minio is Minio on top of a file system. So it could be a drive formatted with ext4 xfs like my, my, my default choice is usually xfs on Linux, but it runs even on VFAT to variety of file systems. The previous you guys running on it on top of the fs, or you can actually Minio even supports NAS devices. So you have I salon or a NetApp filer, VNX arrays, you can simply mount the NFS volume and run, run Minio on top. It turns your enterprise NAS array into Amazon S3 replacement. If you have multiple NAS volume, say I salon can actually support multiple mount points and much more parallel throughput, then Minio also supports shared NAS backend. You can run multiple Minio servers on each one of those mount points, and it runs in a coherent distributed way. Now, if you are building a real serious storage infrastructure to stand up against Amazon, buying a lot of I salon would be very expensive, right? It wouldn't make sense, but you can't really build a modern cloud storage infrastructure using a legacy, sand and NAS array. So that's when you would then buy a bunch of Dell, JBod, super micro JBod, JBod is simply a server with lots of drives. Don't go through, don't go by rate controllers. They don't work anymore. For the scale we are talking about, simply a rebuild time is unbearable, right? It can take more than a week. So Minio, we actually brought in a high performance erasure code, and the erasure coding is actually hand optimized in plan 9 assembly language. Why plan 9? Because Golang developers came from plan 9 operating system. Its internal assembly format is plan 9 assembly format. All the crucial parts that require high performance Minio code base is optimized to do that. Now, if you run Minio on a bunch of drives, no rate controller, Minio knows how to combine them. Think like ZFS. Minio is like ZFS of object storage, except it's also a network storage. So you got a JBod, it has 12 drives, you started Minio on it, and Minio would combine those drives and present it as if it's a single large object storage. And the erasure code, I'm going to explain that here, that objects are kind of striped across the drives. You can lose half the number of drives, and you still would not lose any data. The reason for us to do this is you don't want to be running to the data center in the middle of the night to replace drives. Most often, like oops, I pulled the wrong drive. And when I tried to pull the drive, power came off, power cable got loose and came off, Ethernet cable came off. All these problems are actually very expensive. The most expensive part of an operation is actually human cost. That you should eliminate operations as much as possible. So Minio's point is that you deploy and you forget it, you only go to data center to decommission racks and racks of machine, not fix drives and motherboards anymore. So Minio supports all these, if it's a box with just a raid or a NAS or a single drive, it turns that file system as an object storage. And in that case, it pretty much runs like an FTP server or an NFS server. You uploaded objects on it, and they look on the objects that are uploaded through Minio, they look exactly as files and folders on the back end. You turn off Minio, you look at the back end, it is exactly the same. In fact, you can run Minio on top of your existing file system with content, and it would just appear as an object storage. In a erasure code mode, Minio combines writes the date objects as strived erasure code across drives. If you say like, in this mode, it's a single box multiple drives, the box can go down. Now, I want high availability too. One simple technique you could do is just have two boxes like this and you do MC mirror and MC mirror pairs them and keeps them in sync is as and when objects come in, it keeps the other server in sync. But you say, I don't want replication, I want even higher availability because replication is not sufficient. Often the problem with replication is you need at least three to establish a corum guarantees. The erasure code that Minio supports is also distributed erasure code. You can actually run Minio across multiple nodes and you type the same exact command they all cluster themselves up. And now we have the number of machines in that cluster can die and you would not lose any data and the servers are still available. If you want to build very large system, don't keep on adding machines even in a distributed mode or standalone mode. What you would do is deploy many Minio instances. You deploy one Minio per tenant. Say you want to build a Dropbox like application. Don't put don't build a very large say 10 petabyte namespace and put all the users as separate directories and create lots of users in Minio. That's not a good way to scale a better rate of scale is Minio would be deployed one per user one per department or one per region like one per corporation or how you the tenant is a loose terminology. It's an application instance. You scale by deploying many, many Minio instances at large scale. Often users deploy this through my through Meso's Kubernetes Docker swarm, one of these orchestration software. And that is how you deploy your rest of the application building blocks and Minio seamlessly scales along with that. The DC, DCOS universe package is there for Meso's users. And if you are looking at Kubernetes, there is a health package and you would just do that way. This is pretty much how Minio is deployed nowadays. Minio just runs along the side of other application building blocks as containers. But it is just a static binary runs on any animation rate. Now let's go back to the demo. There is no installation like I told you before. Just download Minio server. Let me show you the examples. The simplest case is Minio server and some directory. And it runs like an FTP server. A more sophisticated use is this. If you've got a jboard, say in this case, there are 12 drives to you chassis typically, right now. Nowadays, they come in one you form factor, which is sleek, like quanta super micro, all of them has and each don't buy a raid controller. These boxes comes with a SAS controller. They just appear as independent drive and you mount them as separate separate mount points. And when you say Minio server and just give pick one directory from each of those drives, now Minio would switch into a ratio code automatically. You don't there is no other specific configuration. The moment Minio sees multiple directories as input, it knows though it has to combine them together. So that's how that that's when it uses a ratio code. Even more sophisticated approach would be this. Say in this case, there are four nodes. And you want to you want you want to combine all those four nodes into one in a simple in a simpler way. There are four drives, one from each of those servers. It does. If you notice the pattern, this is exactly same as the as the standalone mode, except you give a HTTP path to that that drive. Now Minio switches to a distributed erasure code mode in this context. At large scale deployment, these commands are automatically derived and chosen. So when you deploy through mesos or Kubernetes, they know which drives are the right drives, which hosts are the right drives. In some sense, call it cloud native, meaning if it is Gluster or anything else like out there, the other distributed file systems, they they manage the namespace, they manage the resource management, resource provisioning and orchestration. When you add users, when you add drives and nodes, you would tell Gluster about it. And then now Gluster would do the provisioning, not mesos. Now they kind of conflict with each conflict with each other. Whereas in Minio, when you add drives, when you add nodes, you always tell mesos, Kubernetes and others. And when applications are provisioned, Minio is provisioned just like an application. And Kubernetes is in the is in a better position to do resource management and orchestration. And Minio should not come in the way. So it is deliberately designed to work that work in that mode. Now let's take let me show you a simple case. So all I did was Minio server and and dot. So this one just ignored that message just says there is a new update available. My my version is already three days old. Minio is also released in a way that every important fix we make, don't hold back because nowadays, see, here is a 3.5 very stable and you stuck with it. And then 4.1 came 4.2. You are like everything is running fine. I don't want to touch it. Now 5.2 came. Are you going to plan a major downtime and and upgrade and migration? Those days are gone. There is no weekend and nighttime anymore for the SAS applications. And it's important for us, our users also continuously keep them up to date. And every important fix we make, we tag them as experimental or stable, we keep pushing these important updates as soon as possible. Don't hold them back that way. Any failure, you you quickly catch them and and failures are small. Failures are inevitable, right? But as but as long as they are small, you can handle them. If they are big, then everything blows up, right? Now when you start the all you all I did was Minio server dot and then it started. It's already running. This is a standalone simpler mode. In fact, this mode is already useful. Say when applications are deployed in digital ocean, for example, the digital ocean block drive, which is the hard drive that's attached to the container is already backed up and reliable. And you are like, I don't want a ratio code or anything. This is how you would do it. Now, as your application scales, you spin up new containers for new set of users in your app. And it would spin up many new Minio instances, even within the single application constant because context, because applications are today multi tenant by default. And when you spin up new Minio servers, each Minio server is running on different port on same machine or different machine. And they all all they have to do is their own separate directories. And you keep scaling you deploy many Minio context even with many Minio servers even within a single application. When you start Minio, it says here is your IP address here is your access key, secret key and region. You are good to go. It's already fully running. If you point your browser to it, my resolution is sort of messed up. But you can see Minio even has an embedded web console like like Amazon web console. And everything you need is just part of the Minio server itself. You can you can access this from the browser or from the command line. So it shows all the same directories that you saw there. And so like, the reason why LS works is is MCLS. So pretty much in my laptop, CP LS, everything like, see, I'm doing LS. Now I want I want it to be scriptable. It produces a JSON output that is scriptable. So MC has all these commands, like LS, cat, pipe, mkdir, share, share actually creates a unique URL that is self expiring. So you can share that URL with someone without giving them access. It gives you a temporary access and the URL expires. You can tell the share command that expire this within seven days or even just two hours, four minutes and things like that you can do. If RM mirror mirror is the RC command, even like MC pipe command, say you are you could do a mask you'll dump straight into an object. And the object storage is running on Amazon S3 or it could be Minio on a remote data center, you would just do that you would pipe the output directly. Or you can just do regular MC CP select CP. That's what it does. If the same command if you want, say you want you want scriptable, you don't want a fancy progress bar produces JSON output. The idea behind Unix tools like produce, say LS will show a number of that size of the file that's like I have to take a calculator to count. If the console supports color, if you're printing for human output, show me in this is this many megabytes, this many kilobytes, right? Why show me a number that I have to calculate. And then you chose that because you want you want it to be scriptable. But these tools produce output that is inconsistent, not defined by POSIX, and it is point less, right? So Minio's point is if the terminal supports color, enable color by default. If you're printing for humans, show them in human readable output. If you are scripting, pass dash dash JSON flag, that makes more sense than the Unix way of doing things. It's sort of this, this philosophy came from GNU coding standards that if you can do it better, do it. If it means breaking the old ways of doing things. So that's pretty much what Minio is. The rest of them, I'm just going to browse through Minio supports Lambda compute APIs, NAS, like NAS Kafka, AMQP, we can actually send a message as and when objects come in. You can even do MC watch when you see this, when I showed you the command line. MC watch can actually behave like file system I notify on your command line. So you can even write shell scripts, Lambda jobs as shell scripts. Or you can just store the events as and when objects come in, get deleted. Minio can store those events on a database. So you can query, show me all the objects that are created or show me all PDF documents that are created within this time frame. You do a SQL query into Postgres. Things like that you can do. And it supports full blown Lambda compute APIs and is compatible with Amazon Lambda functionality. Minio is popular. It's mostly deployed in US, Europe and Japan. And it's already now the fastest to growing open source object storage. It's getting deployed in both public cloud, private cloud, all kinds of environment. That is it. Thank you. Do we have any questions? So let's say I have a very large NetApp infrastructure existing. All file systems presented via NFS. I could put a Minio server on a box that has everything mounted and present that internally. But what if I was very worried that that particular Minio server could go down. Can I have multiple Minio servers, one per node, and load balancers behind a VIP? OK, so to repeat the question for some of these online viewers, does Minio run on top of a NAS filer as a distributed mode? Minio runs on top of a NAS filer when you mount a NFS volume and you run on top of it. It turns that NFS mount point, a NAS filer or NFS server into Amazon S3 object storage. Yes, that's understood. The question is, now, what if I made multiple mount points, particularly a scale out NAS like Icelon, you can have many mount points where each mount point is pointing to different NFS IP, so you could actually exploit its parallelism, its distributed nature. Now, can Minio take advantage of this setup? Now, yes, so you run many Minio servers on each of those mount points. Now, the other part of the question is, when you do this, now you have many Minio servers, can I put a load balancer in front of it? Yes. So it's actually one of the deployments in a medical environment. These are medical archives, PACS data, non-dicom images. The medical images are stored in this repository. And they actually use Icelon and EMC VNX arrays because their data protection is trusted by these hospitals. They have been around for very, very long, and there are so many people to support. So that's the reason for EMC. Now, Minio runs on top, but then how about a load balancer? In this case, they use a FI load balancer. But most common deployment in Minio's case is like InginX load balancer. I also find nowadays Caddy traffic with t-r-a-e-f-i-k, there are so many. Standard HTTP load balancers should just work fine. Yes. Kind of going on with that. What about putting something like Varnish in front of it for caching? Or does it already do some caching internally? So I have not tested Varnish, but I have seen users using the other load, like I've seen InginX to other, and using those caching ability. I only on had full-blown caching and traffic shaping, all those functionalities. The feature that you see in Minio now is actually much smaller than all the code we have thrown out. And we had caching and we removed it. There is still some amount of caching in the erasure code part that you don't want the repeat re, like the same object frequently used. You don't want to be taxing the CPU. So we cached that. But generally, we don't want to come in the way of kernel cache or the load balancer cache. So Minio itself does not do caching. But a caching proxy in front, it should work. I heard from one of our engineers that Varnish works, but I have not seen it. But it should actually just work fine. There is also a new project, a new capability we added to Minio, a Minio as a gateway, meaning Minio server itself can act as a caching proxy. This is very early at this stage. All we have is, say, like the user storing data on Microsoft Azure Blob storage. They wanted Minio to translate Azure into an Amazon S3 compatible object storage. So can I use Minio as a S3 gateway into Azure Blob? And we added that capability. The moment you add that capability, Minio becomes a proxy gateway. If it's a proxy gateway, then it makes sense to do caching proxy because proxy and caching always goes together. So now we can justify adding more intelligent, smart caching. And this is a work in progress. But I would actually recommend Varnish. Varnish should just work fine. If not, we can test and document it. It would be a better alternative. Hi. If I have multiple Minio servers deployed, and this is like a distributed setup, how do they communicate with each other, the different Minio servers? So they have their own, they internally communicate through GoRPC. So if you see this line, the way you would run Minio is when you say Minio server, HTTP, path to the drive on the other node. But you have now four nodes. You would type the same exact command on all of those four nodes. And each of those nodes, because they see the path, they know each other that they have to now cluster themselves up. We had a lot of debate about, should we support HCD, like ZooKeeper type external locking coordination service. The problem I have is, you saw Minio, there is no installation required. It's just download and run. Setting up HCD or ZooKeeper and managing them is quite a skill. We did not want to complicate Minio or have complex dependencies like that. So we wrote de-sync to eliminate that part. But also if you notice Minio, the way Minio scales is not one big gigantic basket, but many smaller baskets. Now Minio's requirement is a locking synchronization service that's not dealing with thousands of nodes, but it is dealing with 16 nodes. And that takes a completely different algorithm. And you can do that with a much lesser, easier approach by a simply establishing quorum. So we wrote our own de-sync for coordination and synchronization. De-sync is like go mutex, but except it allows you to scale across in a distributed mode coherently. But the actual communication itself, if you saw the HTTP, all the communication is standard HTTP port, except this same HTTP port, same HTTP connection. But over the HTTP, we use goRPC to communicate. Did that answer the question? Thank you. Which was? Oh, OK. What about SSL support? It supports. So Minio, if you place the certificate in the private directory, Minio just uses a TTPS on the same port. Minio also knows internally, because we don't want to actually even create a port ID, port 443 type, like two different port confusion. So Minio server would recognize, the router would recognize if it HTTP connection or a regular HTTP over the same port, and it knows how to handle. So you typically would put a proper certificate in its .minio directory, .minio slash certs. If you put a self-signed certificate, self-generated certificate, then tools like MC, there is an option to tell non-trust certificate in secure mode, so they would still establish connection, and they would work fine. We looked at, I'll leave that later. Next question. I answered your question, right? In your previous slide, you had the distributed erasure encoded mode. So what happens if one of the machines goes down? So I'll remember to repeat the question. So in the distributed erasure code, what happens if one of the nodes goes down? All of those nodes, so let's take, you have 16 nodes. Each of those nodes have the same access to all of those 16 drives. Now, any of those nodes you access, they all coherently access the drive. Now say, if one of the parity drive goes down, so first before I go into the detail, the erasure code computation, the calculation, is per object. So if you look at the back end drive, you would actually see the bucket name as top level folder, and object name path, like mydocument slash food dot jpeg. You would actually see mydocument food dot jpeg path, but under that directory, you would see only a subset of the file. And it would have a JSON file that describes metadata. Minio has no database dependency, nothing. There is no SQL light, any little metadata it has, it's a JSON file there. Now, every server has equally the same knowledge to reconstruct. Every time when you do a get object, if all the data blocks are available, you don't need to do any erasure decode, it just gets the shard and concatenates and streams back. But if any one of those data blocks are missing, you have to do erasure decode. Now, as long as you have, say in this case, there are 16, there will be eight data blocks, eight parity blocks. As long as you have any eight available in any order, so if you have 16 machines, as long as eight servers are online, it knows how to reconstruct the data. It uses a read-solving algorithm to reconstruct the data, and then it streams the data back. It's just simply a global constraint that why it is n by 2 parity. Again, like I described, the drives are cheaper than human errors. Right, so. Let me bring you to my question. If I have multiple nodes, for example, if I have four nodes, and I upload the object, how many copies I have on the disk? OK, so if you have four nodes and you upload a single object, how many copies you have? In the four node distributed mode, Minio does not do replication. Replication has a problem of, minimum, you need three copies to establish quorum guarantees. And replication is very complicated to do it right. Even if you did it right, if one node goes down, say randomly, right, so if you have, say, nine nodes, you did three copies, any one node going down, a subset of the data became read-only. So replication simply does not do a good job at very large scale. So that's why we switched to erasure code. Given that it's an object storage, it's a blob, immutable data, we can afford to do because object storage is hardly chatty. So now the copies, how much disk space it consumes. So when you upload an object, there is no copy in erasure code. The object is just strided across, it's sharded across those drives set. So there is no copy. But how can I calculate, if the question is about how much disk space does it consume? So in case of a replication, you would take 3x. So if you upload a 10 megabyte object, it would consume 30 megabytes of disk space. In case of this erasure code, a 10 megabyte object would consume 20 megabytes of disk space. It is only 2 thirds of a replication to start with. But here, half the number of drives and nodes can die means it offers you so much more reliability for even lesser space, now you can build a very large infrastructure with cheaper cost. Technically speaking, with 10 megabyte object, you can even do like rate 5, only 11 megabytes of disk space consumption. Or say if you do like rate 6, we could simply do 2 parity and 12 megabyte constant. These options, we did not even expose because users would use them and they would lose data. So be opinionated, many times they would argue so hard that I can do it, I can manage it. We tell them no, if you can always change that global constant and recompile the code, but it's a bad practice, don't do it. So the disk space usage is 2x. I just had a follow up question so that I understood this correctly. If I wanted to add a node, since like configuration is static, do I need to add it across all servers? That's a good question. So how do you add a node? No, you don't. It is deliberate by design. So say in Gluster, what we did was when you add a node, you can add to a running node and it even has like the existing config view and the new config view. All the clients are also synchronized and all the IEO follows like two different paths. While we got it working and it does a good job, it's a nightmare. I know how scared I am about that code. Anything like that you should eliminate. In reality, in this context, it's not a requirement because you don't really add nodes to a single tenant. Mostly what you do is as your application grows, say you built a Dropbox-like application. Now new users, new departments are signing up. What happens often is not a single user growing. It is new tenants are getting added up. So the way you would build a Dropbox is not a monolithic Dropbox, but Dropbox per tenant. And then you spin up new separate Dropbox. This way, when you have 1,000 tenants or 10 million tenants, they are just entirely separate set up. Say for example, Minio running on digital ocean at a large scale, we have no official partnership with digital ocean. They are not even, of course, now they know there is a lot of Minio deployments running. But the idea is many different tenants running Minio completely in isolation. Now new users getting added up. Their infrastructure spins up new resources. And these are separate isolated systems. So in that context, I find it less useful to actually be able to add and remove within a tenant. And it's about spinning up new resources per tenant. And why would you complicate it so much? Because that's a really hard problem. Any time you add, that means you're expecting me to do auto rebalancing in the middle of IO. It's not worth all that complexity. So eliminate that, and you want to add resources. You always add a new setup, new setup, new setup. Even if you really want to migrate this user to a new setup, you would start a new Minio instance and then do MCCPR, MCMIRROR, and decommission the old one. Because that is what essentially it would do internally. And this way, it is more controlled environment and much simpler, very predictable. That's what you need for scalability. OK, thank you so much. Thank you. Testing? Testing, can you hear me? OK, we're going to get started. This session is on ZFS, New Features and Replication. Our speaker is Dan Kimmel. And this talk gives an overview of the send-receive mechanisms in ZFS that allow simple, incremental replication to remote systems and describes new features that improve the workflow. So without further ado, I give you Dan. Hey guys, can you hear me OK? All right, cool. So before we actually jump in, I want to quickly introduce myself. I am an open ZFS committer, and I'm also the file system manager at a company called Delphix. We use ZFS inside of our product to basically make cloning databases really simple and easy for our users. But today, I'm actually going to be talking about something totally different in ZFS, which is the ability to send a ZFS file system from one system to another. So before I start out, just raise your hands. Who has used ZFS before? OK, all right, a lot of people, excellent. So has anybody used ZFS send-receive before? OK, cool. That's awesome. So I'm just going to quickly go through the history of ZFS. I'm assuming that most of you have used ZFS on Linux, but it's actually a much older project than when the Linux port of it arrived a couple of years ago. It was first developed, or first released, excuse me, as part of Open Solaris in 2005. And it was actually originally ported to Linux shortly after that in a Fuse file system. That's now defunct. But more recently, we've gotten a native port on Linux. I should bring that in. OK, cool. So more recently, we got a native port to Linux that went GA in 2013. And as of 2016, it's now available in several open source distributions of Linux. And it's also being used as part of Docker and LXD to store the images for containers. So I'll quickly breeze through the ZFS description, because I know that you guys mostly have a good idea of what ZFS is. But basically, it's a local file system that does some logical volume management as well. That means that it can manage multiple disks and redundancy between them. As long as we're talking about operations on file systems, it will allow you to do snapshots of the file systems, clones of the file systems, compress the data block by block on disk, check some of the data end to end in the block pointers, so that as you read the data back out of the file system, you can detect whether or not it's been corrupted since you last touched it, and a ton of other things. I'm just going to talk mostly about snapshots and not talk about any other major features during this talk, because that's the biggest thing that comes into play with send and receive. So just to quickly give you an idea of what ZFS looks like on the command line, there's this command called zpool, which allows you to create a storage pool of multiple disks. And here I'm creating a mirror between two different disks, and I'm calling it tank. If I want to create a file system, that's the second line here, and I'm creating the file system with compression LZ4, so each block on that file system will be compressed using LZ4 from whatever the default size is. If it's 8K, then it will compress down to something like 2.5K on average for that file system, depending on what you put into it, of course. I'm going to write some data into it, and then what you can do in ZFS is basically take a copy and write snapshot of the data and give it a name. In this case, I said Monday. And then you can create a clone from that snapshot that is then writable, and you can keep track of the change from the original data set from that snapshot point. So really quickly, to give an idea of how the copy and write mechanism works, I'm going to show what happens at a high level inside of ZFS when a data file gets updated. So here we have ZFS question mark in the data blocks of our file system, which are at the lowest level here. And I'm going to replace the question mark with an exclamation point, because I'm really excited about ZFS, of course. And then, because ZFS stores the check sums for its data blocks in the pointers that are pointing down the tree. So basically, the check sums for the data blocks in this diagram are actually indirect blocks. And then the check sum for the indirect blocks is stored in the root block. So as soon as we make an update to one of the lowest level ones, we're going to have to update the next level up in order to write out the new check sum into that block, and also the next level up. And what that gives us is basically two different versions of the tree. Here you can see there's the old version of the data that's on the left. And that's basically, you can think of that as like a snapshot into the old version of the data that is not writable, because it's just like a static point in time. But you can still use it to view all of the original data, including the question mark. And then you also have the current version of the data, which is writable. But again, it's going to just create a copy whenever we make a change to it, giving us constantly new points in time that we can look at the rest of the data from. So what exactly is the replication functionality that's built into ZFS? We call it send and receive. Basically, from a snapshot of the file system that you want to send to a different system, you can serialize that snapshot into this long stream using the command ZFS send and the name of the snapshot. And when you want to take that stream and put it on to another system and recreate the file system, that works by using the command ZFS receive, which takes the stream as input and writes that data down into the file system. So to give a concrete example using CLI, here I am taking the same snapshot as before, tank my file system at Monday. And then I'm sending that file system into a serialized stream that's just being written into a file. And then I can receive from that file-based stream to create a new file system. And down at the very bottom here, you can look through the list of file systems and see that now I have my FS as well as new FS, and both of them have snapshots that should be labeled Monday. But for some reason, they're labeled 6 p.m. My mistake there. But basically, the idea is we've created a completely new copy of this data, and we've just replicated it from one pool to the same pool. But it's actually possible to replicate it onto a different pool that has totally different storage semantics. So we created a tank as a mirror of two different disks, but we could have created it as a RAID-Z, like more complicated configuration, or multiple mirrors, et cetera. So a common use case for this is actually going to be sending to a different system where that system could be configured differently or have larger disks, et cetera. I'm going to use the piping notation from now on to do this operation in the slides. So as an example of what it would look like over the network, basically the simplest case is doing a send of that snapshot, piping it into SSH, and doing a receive on the remote system. That will then create the same exact structure that we saw on the remote system. And one of the cool features of ZFS send receive is actually that it's really easy to do incremental sends. So once you've created more data on the system that you originally sent from and taken a new snapshot, you can actually create the incremental stream from at Monday to at Tuesday and just send that data over the wire to the new system. And as you can see at the bottom here, now we have two different snapshots on the remote system and all of the data from the original data set has been replicated there. Yeah, go ahead. Yeah, so the question was basically after you've already sent a second snapshot using an incremental send receive, is it possible to delete the original snapshot on the receiving system? Yeah, you can definitely do that, as long as you don't ever need to reference that snapshot to do some other incremental send for some reason, then you can delete it as soon as you want. Same thing with the sending side, too. So these snapshots are colloquially called the from snap and to snap. So if I mention this later on in the presentation, you know what I'm talking about. Now, I think the best way to describe what I think is good about the system is to compare it to other tools. For instance, take a really simple example like R-Sync. R-Sync basically has two different sides, a server and a client, and they communicate back and forth and they have to read through all of the data and generate checksum incrementally to see what data has changed and which files and which segments they'll need to send over the network. There's a few different problems with that approach to doing replication. First, it's communicating in both directions, from the client to the server and from the server to the client. And so whenever you're blocked waiting for a response from the other side, you can introduce latency into your replication. VFS and receive obviously doesn't do this. It doesn't even need to have another side to replicate to. You saw before I just replicated into a file and that was perfectly adequate. That means that if you wanted to, you can even create this file ahead of time and then just mash it over the network as fast as you can. Secondly, I mentioned with R-Sync the idea of having to read through all of the data in order to determine what has changed. If you have particularly large files and you've only updated a few bytes in one of them, then R-Sync will notice that the modification time of the file has changed and then it will have to scan through the entire file generating checksum the whole time. When really, you really just want it to recognize the tiny piece that is different than before and just send that without having to scan through the whole rest of the file and do a ton of unnecessary IO. Also, R-Sync doesn't have any concept of where the blocks are on disk, so it doesn't really have the ability to do any kind of pre-fetching. It could request the blocks in another thread or something, but it's definitely a lot more complicated to write that. And finally, the last thing with R-Sync is that it's very difficult to make sure that all of the really complicated file system properties on the sending system are perfectly replicated on the receiving system. So this includes file permissions, ACLs, user and group permissions, et cetera. So ZFS, as I'll talk about in a moment, actually handles that in a really elegant way. So I'm going to dive into the last two of these points, basically the sending minimal amount of data without needing to read through everything to see what's changed, and the completeness of data sent. So first off, I'm going to give you the example of doing an incremental data update, where basically we're looking at Snapshot 1, which is marked in all of the block pointers on the left branches of the tree. And then you can see that we have Snapshot 2, which has the single update that we made earlier on in the presentation to add that exclamation point at the end. So basically, when we're doing an incremental send, the way that DFS works is it starts at the top of the tree, and it looks at the birth times of all of these different blocks underneath each level of the tree. And whenever there's something that it doesn't need to send, it doesn't even look through that part of the tree. So here you can see that it's going to take the path down to, but it's going to ignore that 1, because 1 has already been sent. You have told it that based on the snapshot that you're sending from in the command line here. And then same thing on this level. It's going to look at 1 and 2, ignore the 1, go down the 2, and it's just going to read that one tiny bit of the file that we changed, rather than going through the whole thing to check some and see what's different. I'm lying a little bit about the, or I guess what I'm saying is I'm making this a little bit easier to understand by saying that the names of the snapshots are 1 and 2, but actually the way that DFS does this is like an internal format called a transaction group number. So the last piece that I'm going to talk about of like how this is architected already is how DFS manages the sending of all of the extra attributes of files and metadata about what's being stored. So DFS send operates just on DMU objects. And basically everything in DFS is stored in the DMU as an object. And it doesn't need to know anything about what's stored inside of any of these objects in order to send it. And so when we store things like directory structures and Windows users and NFS before ACLs, all that stuff is just stored into DMU objects. And DFS send receives doesn't need to know about any of it. It just looks in the object, grabs the data from it, and puts it into the stream. And so whenever you add a new POSIX layer feature into DFS, you don't need to make any modifications to the replication mechanism in order to make that work. All right, so now on to the interesting new cool stuff. So I'm going to talk about a couple of new features and send receive, the first of which is the ability to resume a replication that either fails or is interrupted. And then the second one is looking at send streams and trying to make them smaller so that we can send them over low bandwidth networks. So first off, what exactly is the problem with resumable replication that we're trying to solve? This is an actual customer problem that we have had at Delfix, where the customer is trying to send an enormous amount of data over the network. And maybe once a week or so they would have a network outage, so the 10-day replication would die in the middle of it. And because DFS previously didn't support any kind of resumable replication, all the progress that they had made would be lost and just destroyed on the receiving side. And basically they just were never able to complete this replication, which made them very sad. So you can imagine that the solution to this is basically to remember exactly where you were in the stream so that you can pick it up on the sending side and resume the send from there, which is basically what we did. So on the sending side, we've always sent things in order of increasing DMU object number comma offset order. So basically in that tree that we were looking at before, you can imagine that that was one DMU object. And offset is going from left to right through the file. So basically we were always traversing the tree in this left to right ordering. And so what that allowed us to do here was basically just remember the exact point that we were and not do any kind of marking of data as we sent it in order to know that it had already been sent. So the way that we actually implemented this was on the receiving side, as you're receiving information into the system, periodically write out the DMU object number and offset that you just received. And then after the receive fails, you can pull that information back out, move it over, just copy it over to the sending side, and then restart the send from that point in the send stream that you were creating before. So what does this actually look like from the command line? So what's in the token? The token includes the DMU object number and offset like I was just talking about. And it also includes some information that helps you recreate the full state of the send alongside those two things, such as the from snap, like snapshot number, and the two snap snapshot name. So how does it actually look? Let's say that you're in the middle of a send, mining your own business, and some lousy kid runs up and just cuts your cable with a pair of scissors, very bad for your replication to have scissors involved with your cords. So first, obviously fix the cord. But then on the receiving side, you basically need to request this resuming token on the file system that you are creating for the receive. So this is just going to be some big hexadecimal string with a bunch of hyphens in it. But basically, it's an encoding of all that information that I showed on the previous slide that's totally opaque to the user. All you have to do is copy and paste it. You don't need to know anything about what's inside of it. And then on the sending side, you basically do this send-t, and you give it the token. And that will pick up from the point where you were before and start sending it again. So this solves the case where we know where we died. But there's another case that this exposed that I'll talk about in a second here. For people who are listening closely, you may have the question, does this break our only communicate in one direction rule? And the answer is, yeah, it does. Because now we're copying this token from the receiving side back to the sending side. But the design of this is such that we don't think that this will happen that often. Maybe if you have a network outage once a day or something like that, which is ridiculously bad network, then you would have to do this once per day. But it's not like every few milliseconds you need to compare a new checksum with the other side of the replication. So what exactly is that other problem that I mentioned? So when we do a send-receive without resumable replication implemented, the last thing that we would send over would be a checksum of the entire stream that we had sent so far. And if that checksum didn't match what the receiving side had calculated, it would immediately say, oh, no, something was wrong. We should throw all that data away. Please just resend it to me. Obviously, that's not good if you've just spent 10 days sending all your data across the wire. And just one little thing got corrupted halfway through. So the way that we tried to solve this was basically by adding a checksum into the stream at the end of each record that was being sent. And so now, as soon as you notice on the receiving side that one of the records was broken, you can fail receive, and then you can use the token the same way as you would for a system shutdown or a crash or something and resume the send from where you left off. Finally, if you fail the send and then you decide later on that you don't want to continue doing the receive, you can also import it using the dash-a flag. And all of these operations are available through libzfs and libzfs core. So if you want to use them programmatically, there are already libraries that will allow you to do that. So moving on to the next problem here. So another issue that you could kind of pull out of that last problem statement was that the user had 10 days worth of data to send. That's a lot of data. And just spending 10 days waiting for a replication to finish can be super frustrating, regardless of whether or not your network is causing failures in the middle of it. So this is especially true when our customers are trying to move data from on-premise into a public cloud and the network connection that they have isn't high enough throughput. There's also plenty of other examples, too. But basically the biggest use case for this is replicating over a WAN. So the way that we chose to solve this was to send the data in a compressed format. And this may not seem super exciting, right? What if instead of doing what I did before, I just used gzip on one side to compress the data right before it got sent and gzip on the other side? Obviously it sends less data when the data is compressed. There's no modification to the file system necessary. Like, why don't we just do that? So first off, gzip is just a pretty slow compression algorithm for the compressed ratio that you get. And so you're going to be CPU bound very quickly. And it's not too hard to solve that. Maybe you just use a different compression algorithm, like LZ4, which is the one that we use most often with the FS because it's so fast, especially decompression, but also for compression. Second problem is that gzip itself is single-threaded. And so you max out one CPU core with gzip or LZ4. That's obviously going to be your limiting factor there. So maybe what you could do is split up the stream into blocks of a fixed size, compress each one, reconstitute them before you send them out so that they're still in order or mark them with something and then reconstitute them on the other side when they come in and out of order. This will work. But the problem is that it's going to peg all your CPUs because now you're going to be doing this compression across more data at once. And when you think about it, I mentioned at the beginning of this presentation that the file system itself is already compressed. So one thing that we were kind of thinking about when we were first implementing this was whether, basically, as you pulled the data out of the file system, it was being decompressed to be put into the stream. And then our initial implementation of this was using compression out in user space before we sent it over the network. So we would do a decompression and then a compression and then a decompression. And then on the receiving side, another compression as we wrote it down to disk. So this is just like a ton of extra work. It was burning a ton of CPU, and that was the limiting factor in a lot of lower end customer systems without a ton of CPU attached to it. So what we decided to do instead was basically read the data as it's compressed on disk on the sending side and just put it directly into the send stream with no other processing before you send it over. And then on the receiving side, you kind of have to ignore whatever settings there are for compression format. If the compression algorithm that has been set for the file system that you're receiving into on the receiving side is different from what's being sent to you from the sending side, you just have to take what the sending side is giving you. Otherwise, you would have to do a decompression and recompression cycle on the receiving side in order to match the algorithm that the receiving side wanted. Maybe that would be possible, but it would still be a lot of CPU burns, and most users don't actually do that. Most of them will use the same compression algorithm across different systems. So the beautiful thing about this is that there's actually no extra CPU time that's needed, and so you really can just fly through your replication from here. So in order to use it, it's super easy. All you do is add the dash-dash-compressed flag on the sending system and it generates that compressed beam for you. On the receiving side, you have to have a similar version of VFS that is able to interpret the compressed stream, but other than that, there's really nothing to this. So when we actually did the performance analysis of this, we started off with a compression ratio on an LZ4 system of 2.67X, and we got a logical send speed up of exactly what you would expect right around there, 2.5X. But then, in addition to that, we also reduced the CPU cost a lot by approximately the same multiplier, but I want to note that the multiplier is not actually because of the compression ratio. It's really the ratio between the CPU cost of decompressing plus sending divided by just the CPU cost of sending with no decompression at all. So really big wins both in terms of CPU use and network bandwidth use. So I'm getting close to the end of my talk here, but I wanted to wrap up by talking about VFS on 1x in particular. So resumable sends and compressed send streams are both available in the release candidates for 0.70. And I wanted to talk briefly about some of the other features that are landing in 0.70, because I think it's going to be a really cool release. First off, there's the compressed arc, which is the RAM cache that VFS uses for storing all of its blocks. Basically, because we're storing the data in the arc in a compressed format, again, we see a 3X improvement in the logical amount of data that you can store in the cache. So instantly, all of our user's cache has got 3X larger. We do have an extra decompression step whenever the data is being accessed. But we've also implemented a decompressed cache on top of that that's much smaller so that frequently accessed data will be ready for the user in decompressed format so they don't get to be slowed down for everything. Then there's also a bunch of new cryptographic checksums that are being integrated that are faster than what already exists in ZFS. And then there's hardware accelerated RAIDZ parity calculations, so for reconstructing your pool when one of the disks goes away and now is much faster, less CPU intensive, and also for checksumming algorithms too. And we've also made some really big performance wins in the block allocation algorithm that we use for storage pools that are really close to capacity. So if you're up to 90% or 95%, you can keep allocating at a pace that is similar to what you had at 50% allocation rather than before where it would basically start decreasing pretty quickly after you got past 75% or 80% full on the pool. On Linux in particular, ZFS has a history of struggling interacting with the memory management layer. And in 0.7.0, there's been a really big change in the way that that works so that now they interact much better. And hopefully, it should not require nearly as much extra RAM to keep the arc from hanging the system. And finally, there's now a fault management layer that's built into ZFS on Linux that allows you to take programmatic actions when the system notices that a disk is going out or you need to fail over the system to something else. That's pretty much it for the talk. But I want to open it up for questions. Yeah, so the question was, do any of these new features require changing or basically recreating the pool from scratch, right? The answer is no. Basically, we are adding on these features as feature flags on the file system. So if there's anything that changes in the on-disk format, you have to explicitly enable that. But that doesn't change any of the data that's already been written into the pool. So new data will be written in the new format. Old data will stay in the old format. For the most part, the only thing in here that changes the on-disk format is the cryptographic check foams that are new. Everything else is performance improvements or stuff that's only in memory. So I don't think that there's anything else in here that would be different there. Yeah, so all these features have landed in the upstream OpenZFS repo. And FreeBSD is pulling them in at the same pace that Linux is. OpenZFS is the broader community of cross-platform ZFS support. It's actually the default master branch upstream is the one that's on Illumos, which is the open source fork of OpenSolaris after Oracle closed sourced it. But all these other platforms are getting the features at basically the same rate that they integrate into Illumos. Yeah. So the question was, can we get ZFS in mainline Linux in the future? I'm not a lawyer, so I don't know. But yeah, the problem is that licensing is different between the two. And they conflict, or some people believe that they conflict. I personally would love it if it got integrated into the mainline Linux. But in the meantime, it works as a kernel module that you can install on. Works fine. Yeah. So the question was, if you don't use any compression on the sending side, do you get anything from compressed send receive? No. Basically, you'd be just as well off using GZIP or one of the other solutions that I was talking about there. Other questions? Yeah. Yeah, I think so. So basically, if you've automated your send receive workflow and you try to use a resumable send receive, but it keeps failing, will you end up with a bunch of half-finished file systems playing around? No, you shouldn't. If you try to resend stuff that has already been sent and you haven't aborted the first attempt, then it'll reject the subsequent send streams. Hopefully, that would prompt you to then go in and actually fix the problem or whatever. But yeah. Yeah. Oh, for the token. Yeah, so the question was, if the receive completes normally, what happens with the token on the receiving side? Basically, there's just no token there. Yeah, if you don't have any reasons to do anything there in that case. Yeah. So the question was, how long do we leave around the failed send remnants on the receiving side? Indefinitely. So as long as you don't choose to abort the receive or resume it and complete the send, we just leave it there. And it uses space. Sorry, say the last bit again. Yeah, so can you remove it? Yeah, you can abort the send, and that will remove all the data associated with it. So the question was, if your send fails and then you want to change the place that you're sending to, sending from. Yeah, so let's say that you fail over your send side to a secondary system that also has all of the data replicated to it. Is that kind of the example that you're talking about? Yeah. Yeah, so as long as you have the data either from the pool that originally existed and you just plug it into a different host, then that would be fine. The thing that, here, I'll actually go back in my slides. The thing that we need in order to receive something is basically to have the snapshot GUID match. So that's preserved across send receives. And so as long as you're trying to send from a system where the snapshot GUID is the same as it was on the original sending system, then you're good to continue sending. And that should be the case for both of the examples that you gave too. Yeah? Is there any way to automate the number of snapshots you want to have on a backup site? So that would say if I upload 10 backups and I have a policy of only keep the last 20, it will discuss whatever I have there. Yeah, so there's a few different user space orchestration frameworks that will automate the backup service for you or will even run this whole replication workload for you so that you don't have to do it manually. But there's nothing internal to the FS that does that for you. What is the feature number? I mean, when you do a VFS update, it gives you the number, a number for each feature that the file system has. These features are on which release number. You know what I'm talking about? When you type VFS update and you don't want update, just gives you the file system that you have, the release number. Yeah, so when you upgrade the set of features that are available on a Z pool, then the way that it used to work when Sun was the only maintainer of VFS was that there was a version number for the pool and any new features that had been enabled in that version number would be enabled. Now that it's kind of a distributed community, we implemented this idea called feature flags, which basically allows you two different companies to implement two different feature flags and have them land on different platforms at different times and stuff like that. So there's no number anymore associated with the pool version. It's all the same pool version now, basically. But you can enable and disable these features one at a time. Yeah. So we implement ZFS sends out to flat files and then end up ultimately storing them on tape as a backup. Cool, yeah. If I encounter some kind of problem, like system goes down during a restore from a flat file, is there any option there for being able to restart or restore from in that situation? Yeah, so in that case, there would be no way to restart it from halfway through. There wouldn't be a way to resume it because there's no system that can generate the send stream again. But you would be able to abort the send and then just take the send stream in again from the original source on tape or whatever. Yeah. I was just going to mention, if you're looking for a user space tool to automate all this, if you Google Zeta back, it's a tool set that basically automates the snapshots and sending to storage and that you can use those to do backups and all that stuff. This one for those, if you're looking for that. Yeah, very cool. Any other questions? OK, thank you very much. Thank you.