 Welcome everyone to the Rook Talk on storage for Kubernetes. I'm Travis Nielsen with the IBM Storage Team. I'm one of the original creators of Rook, one of the maintainers. I've spent a great journey. We created the project, announced that it's already been seven years. So happy to be here with you again today. It's a little bit different size of conference now since Seattle where we announced it, when there were only a few hundred people at KubeCon. Imagine that far back. But yep, happy to be here, Annette. Yeah, hi. I'm Annette Kluwit, work with Travis. And I met Travis about five years ago. At that time I was doing Kubernetes and storage. And Travis gave me a demo and I was all in on Rook after that. So I've been doing that with Travis and others, in particular disaster recovery right now. So I'll go over that later. Thanks. Dmitry? I'm Dmitry Mishin. I'm in the University of California, San Diego, San Diego Supercomputer Center. And we've been using Rook since the almost beginning of Rook. We're still using it a lot. And I will talk about our use case and how we are excited about the capabilities Rook provides. Thanks. So what are we gonna talk about today? I'm gonna start us off by talking about what is Rook? And talk a little bit about why would you use it? Why do you need storage? I think we heard a bit this morning about in the keynotes why data is important. We all know why data is important. We need to protect it. Then Dmitry will talk to us about how he's using Rook in the National Research Platform. Give us some background, show us his topology about how he's deployed it. It's an interesting use case of how Rook is really used in production and large scales. And then Annette will finish up with some application disaster recovery. Talk about these scenarios for how to protect your applications across multiple data centers, multiple regions. Just to get an idea of who's in our audience today, I'd love to know who's here to learn about Rook for the first time. All right, we've got a good crowd. Who's heard of Seth before? Most of you, okay. How many have experimented with Rook before? Okay, good crowd also. And have you deployed Rook in production? A lot of you also. Okay, thank you for that background. So let's go through this quickly what Rook is. Originally when we were starting with Rook, even before we took a bet on Kubernetes, we were looking at cloud native storage. What do we do for storage? What about storage in your own data center? If you're running in a cloud provider, there are options there. You've got AWS has its solutions, Google Cloud, Azure. They all have their cloud solutions. But what about your own data center? What do you do? And then if we need storage for Kubernetes, how do we plug it in? Kubernetes has traditionally treated storage as an external thing. So we'll just worry about stateless applications. It's just an external problem for storage. But why not manage storage with Kubernetes applications? Why does it have to be separate? Why not treat it as any other Kubernetes application? So as we started our journey, we were looking at what storage platform should we trust? I didn't work for the Seth team at the time. It was an independent viewpoint. We wanted to choose a platform that enterprises trusted. We didn't want to build a new storage platform because we know data is a hard problem. And so we made the decision to build on Seth. Seth has been in production ready for many years already. But back to Rook. So what Rook is then, Rook is making storage available in your cluster. So we looked at Seth. We said it wasn't built for Kubernetes. Let's bring Seth to Kubernetes. We created an operator with custom resource definitions to define how you want to deploy Rook. How you want to deploy storage. And then Rook will take over the rest. We'll automate the deployment, configuration, upgrades, and allow your apps then to consume the storage, just like any other storage application. We use storage classes, persistent volume claims, all these other terms that you're familiar with in Kubernetes. Rook, from the start, we wanted to make sure it was open source that would do the right thing for the community. So that is one of the basic principles, open source and open to contributions. Well, then what is Seth? So what did we love about Seth that made us really want to choose it? Well, Seth from the, from the, that goes a distributed software defined storage solution, and it provides all three common types of storage. So block, which is used for read write once volumes, shared file systems with read write many volumes. So you need to, you have several pods, several applications, or multiple instances of applications that need to share the same volume. You can use the shared file system with CephFS. And then if you need an object store, if you're not running in the cloud, or don't have access to the cloud, you want your own cloud with object store. You've got access to S3 buckets locally with Ceph as well. More information on the Ceph website, Ceph.io. And another thing we loved about Ceph was it's also purely open source and available for contributions. It has a proven history with enterprise adoption and first release back in July, 2012. So over 11 years ago, and a great story with CERN's large collider where, yeah, just huge data processing needs that they're using Ceph for. Ceph itself is designed to be consistent. It's not eventually consistent, but once you commit your data, you know it's committed and replicated. Ceph has a great architecture for replication across different AZs or whatever topologies you have, racks, nodes, disks, we can take your storage and in the topology you have and create the storage platform. That replication is configurable. How many replicas do you want? And it's proven highly durable. Even in extreme disasters, data can be recovered with troubleshooting guides. So what does this look like as we brought it together in Rook? So architecturally, we really have three layers to think about. So Rook being the management layer. So as the operator with the CRDs, you can tell us how to configure Ceph and then Rook manages it. Then the plugin layer is with CSI. There's CSI plugins for any of the storage platforms. Ceph CSI will manage that provisioning and that mounting of the storage to your application pods. And then once your data is provisioned and mounted under the covers, Ceph provides the pure data layer. At the end of the day, Ceph doesn't even know it's running in Kubernetes, it's just providing that data layer as if it were running outside of Kubernetes. But it's all running together or it can all run together or it can run separately. Moving on, how do you install Rook? There's multiple ways. You can use Helm charts. We also have example manifests for all sorts of different configurations. There's many ways you can choose if you want to run the three platforms, block file and object or just one or multiple combinations of them. You can get started on Rook.io. Now, where can you run Rook? So anywhere Kubernetes runs, that's our goal to run storage. So whether you're in the cloud or on-prem, there are, if you're on-prem, you have this need for storage clearly. That's where we started the project. But users have even found in the cloud uses for Rook to have that consistent platform for various reasons. You can do it virtual or bare metal hardware. The underlying storage can also be node attached devices or PVs from the cloud or loopback devices for testing. And then Rook really helps enable cross-cloud support so to have that consistent data platform to run in your own data center or in any cloud. We have a mode where you can run Seth externally. So essentially the CSI driver, you configure it to just connect to Seth that you've already got running outside of your cluster. So you don't have to redeploy it inside Kubernetes. And then one feature I'll just mention that's under active development, object storage provisioning. With the Kubernetes community, we're working through the container object storage interface, which is for provisioning buckets with object storage. We do have that implemented in the latest release. And we're in experimental mode, happy to hear your feedback. And until that's more stable, we do have the object bucket claims, which we've been using for several years already to provide that bucket provisioning. So a little about the project and the health of the community. So really our philosophy has always been community first. We want to know what the community wants for storage and we want to again make sure it stays open source. We have maintainers across four companies currently with Sybosu, IBM, Red Hat, and Core and Upbound. I'm with the Seth team, Annette and I are with the Seth team. We moved from Red Hat to IBM just with acquisition, things going on. But yeah, it's all the same. Seth team and Rook working together. We've had over 400 contributors to the GitHub project and just this week we hit the milestone of 300 million container downloads. So kind of exciting. Rook did graduate three years ago, October 2020 with the CNCF. So it's just a testament to how much the community appreciates that openness and that community first approach and running in production for a long time now. A little bit about that journey. So three years since graduation, five years since we declared it stable for production and then seven years since we announced it. Just so many people running upstream. We never even know how many people are running in upstream. Always love to hear your stories about that. And there are several companies with downstream products around it as well that we're not gonna talk about today. But release cycle, so upstream, when do we release? We kind of shoot for about every four months similar to the Kubernetes cycle. 1.12 was in July, 1.13 just with holidays and things coming up. Kind of moved to early December. Little over four months in this case. But we do have regular patch releases where we shoot for bi-weekly unless there's a critical need, then we can ship that as soon as we need just our CI processes. We try and keep those streamlined so we can release whenever needed. And now we'll pass the torch off to Dmitri for the National Research Platform. Hello. Hello, everyone. I'm going to talk about our use case for Rook. So I'm a part of the team of National Research Platform. A National Research Platform is a NSF funded project that's providing compute resources to scientists from more than 15 institutions. Mostly in US, but we also have collaborators from Europe and Asia. And it's based on San Diego, University of California, San Diego. So the project started from a project called PRP, Pacific Research Platform. It was mostly measuring network performance between universities. So they are connected with 10 to 100 gigabit networks and it was making sure that you really get the network performance you're expecting. But we put Kubernetes on those nodes to just handle them more easily and that's how the Nautilus cluster was born. So Nautilus, in this case, is the name of the cluster, of Kubernetes cluster, not the version of SIF. Eventually, last year the project evolved into National Research Platform so that's NSF testbed for the national infrastructure for computations. And what it does is it allows different universities to attach their nodes to a single cluster. And also we provide all those resources for free to scientists who have research projects. That makes it a global Kubernetes cluster. As I said, we have nodes in US, Europe, Asia, Africa is not covered yet. And all nodes are connected with 10 to 100 gigabit science DMZ. So no firewalls, jumbo frames, very well connected, all monitored. And because we are providing compute resources and we need persistent storage and we were using RUG from the beginning for all persistent storage needs in our cluster. Currently, it includes six local and regional SIF clusters inside one Kubernetes cluster. I will talk about them more. And because Kubernetes provides connectivity between all nodes, so each node is every other node as a next hub, all nodes around the world can mount any SIF pool, which is pretty cool. If you mount from far away, you get less performance but it's possible to mount. We don't have problem that you need to move data first and then do your computation. So research and providers can add their own resources, which means they tell us, hey, I have this node connected to science DMZ. They provide maintenance power cooling networking. If something breaks, we are expecting that they fix the hardware. So it's remote hands, but we take it from there and we install operating system. We manage all the software on it and node becomes the part of the cluster and gets all monitoring and gets the jobs from users. And optionally, they can request that they get preferential access to their own hardware. But it's not required to use our compute resources. Scientists can just come and say, hey, I want to run in your cluster, they are more than welcome to. So it takes five minutes from the node with Ubuntu installed to become the part of the cluster with just sensible playbook. This is a map of GPUs distribution around US. So we're covering a lot of app score institutions and minority serving institutions. And most GPUs currently we have are in California and West Coast, but we have a good representation in central states and East Coast. So cluster is constantly growing. We are adding couple nodes per month. And currently we have 19,000 CPU cores, 1,200 GPUs of different generations from oldest 1080TIs to newest A100s and so on. And the map of self storage. So currently it's five petabytes again scattered across US in six self pools. And it's covering many regional networks. Scenic is providing network in California and all the internet too is covering most of the US. And so everything is very well connected with fast networks. Map of our nodes. Again, as I said, California is historically the biggest one. We have many nodes in central states. So this is showing memory CPU and GPU with the size of the dot and number of nodes. So also we have three nodes in Europe and I think five nodes in Pacific region. And most nodes in California are bought by NSF grants and other nodes partially just donated by different universities, which is partially also purchased by the NRP project. So this is the distribution of sizes, capacity, and usage of our self clusters. First one is the largest. It's historically just called Rook, but that's Western self pool. This is showing two petabytes. This week we added another petabyte to it. So now it's actually three petabytes. And it's 1.5 petabytes used. That was after heavy purging. So we had some capacity issues in that. And that's the most used pool by users. Second one is Eastern. That's covering New York, New Jersey, Delaware. That's usage is growing now and it's one petabyte of capacity. Next one is central states. Also close to one petabyte and 600 terabytes used. Southeast Florida. It's our newest one, so it's not used much yet. But again, usage is growing than Pacific. Almost 400 terabytes. And the last one is the smallest local for UCSD. That's NVMe only. So other pools are spinning drives with database on NVMe. And yellow is above 70% used. So this is our dashboard for our largest Western pool. So as I said, it's above three petabytes right now. 1.7 petabytes used. We usually see between five and 10,000 IOPS. It's picking up to 15,000 IOPS sometimes. And the pool can deliver up to 10 gigabytes per second. So we're trying to keep all the pools below 10 millisecond range between nodes, just because that's the requirement of CIF and that makes the pool faster. But users, as I said, can mount it from far and CIF is caching the data. And so they get less performance, but it's still usable even if you mount across CS. We also experiment with other storages in our cluster. So this is the diagram of all storage in the cluster. First one, the biggest is Western. Then we have three OSG, open science grid origins. So that's the project working with CERN on high-energy physics mostly. And those are just three nodes with petabyte of storage attached and using XRUD and stash caches to access that data. So that's the next three petabytes. Then again, a number of SAF and RUG pools. And then there are lin store and CVDFS storages. Those are small and just use this experiment and for very small use cases. So majority of users are still using RUG and SAF and that satisfies all the needs for users. Thank you, Dimitri. Hi, I'm from there. Hi. So I wanna switch gears just a little bit here and talk about a solution I've been working on for the last couple years, which is to take RUG, which orchestrates and manages SAF and combine it with a few other upstream projects to automate application disaster recovery on Kubernetes. So what we're talking about is a situation that it could be a true actual disaster. I was with a financial company when Sandy Hurricane hit New Jersey, New York. And the company I was with figured out they had way too many data centers and the circumference of that disaster. And some of the data centers were unavailable for months. So sometimes you can communicate when you have a disaster with what you need to and sometimes you can't. So let me just sort of give you a visual and this is, you know, regional is, or a region doesn't necessarily have to be that far away, but the main thing here is this solution is asynchronous. So we'll see that you're gonna have some data loss, but you can sort of cap that data loss based on how often you move data from one region to the other. The applications will only exist on one cluster at a time. So the data will be replicated from one cluster to the other but until you need to use the application on the opposite cluster, it doesn't exist. So you're not using resources or just waiting for a disaster. Both clusters can be used, but just on any one application only exists on one at a time. So, and again, disaster recovery resiliency is not a new concept. It's been around for a long time. I think the difference with containerized platforms is we don't see a lot of it yet, but as containerized platforms become more sort of critical, they are going to be required to have disaster recovery planning and resiliency. So we have two measures, recovery point objective and recovery time objective, again, not new, but what we wanna do with containerized platforms is we wanna get those down to minutes instead of days. In the case of RPO, for an asynchronous replication, your replication interval decides really how much data can be outstanding. So if you're replicating every five minutes, then you could lose up to five minutes of data. For RTO, even if you lose zero data, you're still gonna probably have application downtime because you have to reinstall the application on the alternate cluster. So Rook has been really instrumental in helping the solution along. One of the Rook CRs or CRDs is a CefRVD mirror. So Cef has and has had quite a long time, the ability to do a mirroring, which is to do the replication between Cef clusters. So this is not one cluster, this is two clusters and we're replicating via like snapshot the data from one to another based on volumes or images. So that is coming out of Rook. And then out of the CSI add-ons, we have the volume replication CR and the volume replication class. And those are really important to the solution because volume replication is what we use to enable and disable mirroring instead of having to do it with Cef commands. And then volume replication class is going to have the interval of replication. So those two are coming again out of the CSI add-ons. So another thing that we need to have to be able to make the solution work and this is coming again from CSI, we need a volume replication operator and we need an OMAP generator. So these are sidecars. If you've ever used or deployed Rook, you'll see that there's CSI pods that get created. One of them is called the RBD Provisioner that these two sidecars will become containers within that RBD Provisioner. And you can enable those if you're doing it manually via the config map for Rook Cef. So if we're doing this again using Rook Cef and using the capabilities there, we're going to be able to change the action of the volume replication so that it will go ahead and promote and demote. Actually, let me say that. You will be able to enable mirroring and then what we'll be able to promote and demote storage will be the volume operator that we saw the CSI add on. So when if you think about it, if I have a volume that is using an image and that image is replicated over to an alternate cluster, I have to be able to demote the storage on one cluster if I have access and promote it on the other so that it can be used. So application failover is using the custom resources I went through. There's one case that I think doesn't get enough attention which is the second one, which is what if I just want to migrate the application to a different cluster because it's closer to the users or maybe I don't like the cluster I have is out of resources. So in that case, you can scale the application down, sync the replication, all the outstanding data, therefore your RPO is equal to zero. It does require though that both clusters are healthy for just doing a migration. In the case of disaster recovery, it doesn't require, one of the clusters could be like I said in the sandy hurricane and not communicating and you'd still be able to recover on an alternate cluster because the image, the persistent data is on the alternate cluster. So what I wanna now go to is sort of the solution part of this, which is how we can combine a couple of other open source projects, open cluster management. I don't know if you've heard of it. It is in a CNCF process. So but what it's good at is application lifecycle in particular, deploying and scheduling apps if they are available on a Git source. So we can use Helm charts, we can use Customize but basically this will allow you to schedule an application via OCM and we're gonna combine that so that we can automate the creation of the application on the initial primary cluster and then if needed, recreate it on a secondary cluster. Really important sort of glue for all this is a Raman DR project. That's the one I'm involved with. And we use OCM, we use all the custom resources that you saw but in addition, there's some new CRs that are coming out of this project. And they are DR policy. And DR policy, everything that you do with this solution is in groups of two. So even if I had 100 Kubernetes clusters, I'm gonna divide them into 50 peers. So each cluster has a failover cluster. So the DR policy defines which two clusters are gonna protect each other. The DR clusters then is sort of an outcome of that which is what are the clusters? Those are cluster scoped custom resources. And then we have the DR placement control. It is going to be, it's going to basically the action whether it be failover or a planned maintenance migration, it's gonna define the action. And that will be, it'll be on the hub cluster which we'll see in a minute here but it will define the action of what are we going to do. Another really important custom resource coming out of the Raman project is volume replication group. So that is, that is actually created on what we call the managed cluster. And in that case, once a DR placement control is created, then an associated volume replication group is created on where the application actually lives. So we've seen this at the beginning here but I just wanna define a little bit more here. You see in the middle there the open cluster management. So this is gonna take three Kubernetes clusters. One of them will be the hub and that is why we don't need the clusters to be communicating because the hub is actually where DR placement control is going to decide the action. So yet, say the cluster on the left is no longer communicating. Say there's a hundred applications there. We can use the hub to actually move those applications or recreate those applications on the surviving cluster. And once the other cluster comes back online, we can move them back using like a planned migration. So it's really powerful the fact that you have the hub and open cluster management, remember has deployed the application via a Git source so it knows how to redeploy the application once the storage is promoted on the alternate cluster. And there's some other things there that you can read but the whole thing is we want low data loss and we're talking minutes and we want low for the application to be able to be reinstalled. We want that to be as low as possible. Just a minute. Okay. How do we put that back in? I think I hit the wrong. Sorry. Okay. No, it's in some other mode. Okay. So what I wanna do, if you're interested in this idea, the team I'm on just for our own, because three clusters is a little bit of a heavy kit, especially if you have control nodes and compute nodes and every one of them. So we came up with a way to use a VM or a machine that you have that has 8CPUs, 32 gigs, still a heavy lift and running some kind of Linux, we've been testing in Fedora. And with this, you can go to the ramen DR GitHub, go to the docs. I recently rewrote the quick start guide so that hopefully there's not any missing steps. And then also I did a video that's in blue there. But what's nice is it uses mini cube and it sets up the entire environment. Let me just show you. Sets up this entire environment, sets up the stuff mirroring, creates the three clusters, sets up in the middle there, you see a S3 bucket. The application can be redeployed, but what we absolutely have to make sure is the PV and PVC are using exactly the same name and the same definition. So what we do is we move that definition to an S3 bucket so that on the alternate cluster, it can get that data and make sure that's hooked up correctly. So feel free to try it yourself. And once you, you know, and I think I showed it, but out here, once you get the prerextone, like installing Podman and some other things, you can basically use the DRMV tool, do that start against that YAML file, creates the whole environment for you. So I think we're at questions. Thank you. Yes, thanks to me, Trina and Annette. I think we're about out of time, but we'd love to see you at the Rook booth in the Project Pavilion. It's way at the end next to the CNCF store. So we'll be there for a few hours this afternoon and then tomorrow. It's just for the first half of the time in the booth area. Yeah, but if there's any burning questions that we could do here in a few minutes, if anybody has anything. No, one question? Okay. So I see, I mean, in your slides, you're replicating the volumes as well as replicating the S3 object storage systems. So I just wondering, like in what kind of consistency guarantees the system allows or does? I think you said in terms of consistency, right? Or when you, is that what you asked about? Yeah, so data loss is one thing, but like having the volumes in inconsistent state or the storage. Well, I mean, yeah, so it'll be crash consistent, but it won't be application consistent. Like if you have a database, you could lose transactions. Okay. Yeah, it's using a snapshot technology. So it does a snapshot and then transfers the snapshot. I mean, you could possibly have hooks that would quest it and we're working on looking at that, but it's, you know, the hooks are pretty much per application. Yeah, because the application can look at like a bunch of files, like a set of files, like application might need a bunch of, or a set of files to be there in order to see it as consistent than sometimes. Yeah, I mean, right now we're using what we call declarative. So you declare it in a Git source. So when you recreate the application, you know, it's all, it's declarative. We are looking at what we're calling imperative, which is something where the application is not in a Git source, but you still want to use the same process, which maybe is more of what you're talking about. Yeah, sounds good. I'll, we'll catch up. All right, thanks. All right, thank you. Well, thanks everyone. Again, feel free to come up for questions or we'll be at the Rook booth. Okay, thank you.