 Hello, DevConf. Welcome to the last day of the conference. I know it's late in the afternoon, well, early, but thank you for joining. I know some of you have flights to catch. So I appreciate you being over here, hungover or otherwise. Let's get started. Today we'll be doing a presentation on raw block persistent volumes in Kubernetes and how they help change the way the Rook project deploys in Kubernetes. A fair warning, these presentation may include opinion speculation and bad jokes on our part. They are strictly our fault, and Red Hat will not be held liable for anything we say here today. All right, introductions. My name is Jose. I've been at Red Hat for about 7 and 1 half years now. I've been working in storage the entire time. Currently, I work in OpenShift container storage with a focus on Rook and stuff. And I'm the project lead for the OCS operator, which will not be discussed in this presentation since that does not pertain to Rook directly. And I participate in SIG storage in upstream Kubernetes. And I like hitting things, mostly drums, though. Here we have Rohan Gupta. He graduated from college in 2018, did a Google Summer of Code with the CNCF to add the NFS operator into Rook. And we like that so much, we decided to keep him at Red Hat. So he's currently working on Rook inside of OCS as well. He likes watching anime and riding motorbikes. All right, here's what we're going to cover today. We're going to start by setting the stage and giving you an overview of storage in Kubernetes, what Roblox PVs are, and a quick rundown of Rook and Seth. We're going to go into the details of what Rook and Seth used to do in Kubernetes before Roblox PVs came in and afterwards, some hurdles we encountered in the implementation of it. And with luck and network permitting, we're going to host a live demo. Those are always fun. All right, to set the stage, storage in Kubernetes. When working in Kubernetes, you deal with resources, resources defined against an API. In this case, the core of what we work on are called persistent volumes, which are a representation of some sort of storage volume in a storage system somewhere. The actual API doesn't entirely care about the details of it. It simply presents enough information about it so that it can be consumed by applications running in Kubernetes. And of course, different storage backends define what a volume even means. Then you have persistent volume claims. These are the things that are actually created to be consumed by the application. So an application creates a persistent volume claim that then gets bound to a persistent volume in the back end. And these two bound together represent the connection from the storage to the application. And finally, we have storage classes, which are an object that allow you to request, to dynamically request, a volume of storage from some back end. So the way that usually works is that you create a PVC and then you create a PVC against a particular storage class, which will then, if appropriate, bind a PVC to your PVC for you without you having to specify a specific volume. This is generally the ideal way for cloud native applications to specify storage, since you don't want to necessarily tie your data to any particular locality or identity. You want to be able to just spin up instances of whatever the hell you're working on and be able to use the same data wherever it's coming up. And here's the flow I just described. The developer, in this case, submits a persistent volume claim against a storage class. And the operations or administrator is what sets up the storage class, which then instructs the storage back end to create the persistent volume, which is then what's actually mounted by the application pods. This entire process I mentioned is called dynamic provisioning. The alternative is what's now known as static provisioning in which the administrator doesn't create a storage class, but instead manually creates individual persistent volumes and then those get bound to PVCs. So let me just say the whole dynamic provisioning thing was great. This came around in Kubernetes before I actually joined the project. And I cannot imagine a world without it. I'm sure Yann over here can remember those horrible dark ages. And I'm so glad I was not around for that. All right, Roblog PVs. All right, I keep talking about PVs. And notice that I said mounted by the application. All right, so the thing about PVs up to this point is that they were always presented to a pod as a file system. So regardless of what your back end was, whatever device you were given had a file system on it, and that's what was mounted into the pod. Roblog PVs are a new mode of PVs that allow Kubernetes to present storage to containers basically without a file system. So you get access to a block device rather than a file system. And there's a lot of applications, such as databases like MongoDB or Cassandra, that can leverage Roblog directly without the need of a file system just natively. And in particular, when you remove the layer of the file system, it allows certain storage providers to provide more consistent IO performance and lower latency, since you don't have to deal with the abstraction of the file system in between. And I've provided a link there to the concept documentation for Roblog PVs if you're interested. So to make this work, Kubernetes introduced a new field into the persistent volume spec called volume mode. It's been in beta since Kubernetes 1.13. Jan, this went GA, right? No, not yet. Still beta? All right. We're up to 118 now, right? Yep, still beta. But we'll get there. All right. So this field specifies whether it'll be in mode block or mode file. So both the PV and the PVC have this field, and they must match, right? So when you're doing dynamic provisioning, you create a PVC with volume mode block, and then that must go to a storage class that supports volume mode block, in which case it will provision a PV for you that is also volume mode block. And then, of course, the default is volume mode file, which was presented to preserve backwards compatibility. Here we see the YAML for the relevant specs. In this case, you have your persistent volume with volume mode file, which you can freely omit since it's just the default. And there's the field again. And then I'm going to stay here because the recording microphone is here. Over there, you see a pod spec where you define the volume at the bottom from that PVC, and then you mount it to a file system mount path. That's the way PVs and PVCs have worked before. With volume mode block, you change this field and that field to block. You still specify the PVC down there under volumes, but now there's a new field, volume devices, where you specify the name of the PVC, and then instead a device path where the device will be present. And that will be just a file, not a file system. Any questions on this so far? All right. Oh, got one. Doesn't matter. So the point here of the device path is that this guarantees a consistent name for the same device, regardless of how it shows up on the host. So I forget the specifics of implementation, but the device does get attached to the node, and then via sim links will get presented by a given name inside the container. Now, a quick note here. To those of you familiar at all with Kubernetes, volume mode and access mode are different things. There was a lot of confusion about this when volume mode was introduced. Access modes are things like RWX or RWO, and they define how many pods may attach a PVC at a given time and whether or not they can write to it. So for instance, read, write once means that only one pod can attach to a given PVC for read and write operations. RWX stands for read, write many, where multiple pods can attach to the same PVC and have read, write operations. Oh, what's the other one? ROX? No, what is it? What's the one where one reader and multiple writers? Is that ROX? Read only many. There we go, yes. So where multiple pods can read the same PVC. Real quick, any further questions on Kubernetes? We're about to jump into the application layer. Rook and RookSeth. All right. What is Rook? Rook is storage operators for Kubernetes. They help automate the deployment, bootstrapping, configuration, and upgrading of your storage, systems. And in particular, this follows the operator pattern for storage solution. If you've heard about operators during this weekend at all or beforehand, an operator is basically an application that implements operational knowledge, typically handled by a human administrator, and automates it in an application running natively in Kubernetes. And this follows the general Kubernetes model of desired state versus observed state. So your administrator defines a desired state of what they'd like their storage system to look like. And then the operator takes care of executing operations to make the actual state of the system match the desired state. This is what we call reconciliation of the actual state. And it's just a constant running while loop that will check the state. Does it match? Yes. Check the next part of the state. Does it match? No. Make a change. Restart the loop. Wait for it to change. Eventual consistency, I believe, is the term for that. Now Rook, as a project, hosts multiple operators. The one we care about today is RookSeph. RookSeph brings Seph into containers and in a cloud native fashion. Seph, for those of you that don't know, just like the five-second pitch, it provides resilient distributed storage in a self-healing manner. It is highly scalable and can run on commodity software. That's not to say it can't take advantage of more advanced storage systems, but it can technically run on a wider array of hardware. And it's fully open source. So the general outline of RookSeph is, or the main entity in a RookSeph cluster is the Seph cluster resource. This defines the desired configuration of a Seph cluster, which will then inform the RookSeph operator to configure a system that matches this spec. And there are a number of demons that are part of the Seph system that are running containerized in individual pods. I'm not going to go over all of them right now. All of the ones you really need to care about are up there in the upper right called OSDs. And OSD stands for Object Storage Demon. I believe it does. I use the acronym so much I don't entirely remember what it means. And those are the demons that represent the storage devices that Seph uses to store its data that it will then serve through Seph volumes. OSDs, then and now. So typically in Rook, you had local storage OSDs, meaning that Rook would just directly access the devices on the host. So you would have to rely on hosts that had storage already attached to them. And then you could either manually define which specific devices you wanted to use or have them automatically discovered by Rook and then consumed and then consumed by OSDs. In this case, your Seph cluster would have to define. You could define your storage nodes via names, labels, or just say use all nodes in the Kubernetes cluster. I mentioned you can define local devices either manually or through auto discovery. And then Rook takes care of formatting the devices appropriately, spinning up OSD pods and consuming the devices for use by Seph data. So the thing about this is that Rook was designed with this as its primary paradigm when it was first being brought up. So they made it really easy to configure. If you just want to use whatever you can find in the cluster, the Seph cluster definition is really short. And in particular, it's familiar to people who build traditional storage systems outside of Kubernetes. If you're not dealing with all this cloud native watching or who sits, you would typically have to statically define all the devices that you're using and import them into your storage cluster. And in particular, in doing so, this supports any type of device or appliance that is supported by Linux because all we need is a device to be present in the Linux file system. The downsides to this, especially when considering a cloud native environment, is that this relies on specialized nodes and there is a rigid coupling between your compute and your storage, which some people may not want, especially if they want to have a more robust storage solution. Income storage class device sets. Yes, that's a mouthful. Yes, I did design it. No, I don't entirely like the name, but community consensus is what it is. In this case, you still have to define your storage nodes via names, labels, or all, but instead of defining individual devices or selecting all devices, you define your desired amount of storage. So this is a different way to consider your storage system now because you no longer specify I want these x devices across these y nodes. You just say, I want x amount of storage across all nodes I have available, for instance. And in much the same way, Rook will take care of providing the actual automation to discover or acquire the devices, prepare them, and start an OSD pod for each device. This is a typical configuration of a storage class device set. They were designed to be a generic struct within Rook. So if you go look at the spec, you'll find some fields that don't entirely make sense for RookSef and are not used by RookSef. Currently, RookSef is the only operator that's making use of storage class device sets, but it's open for other operators to make use of if they so wish. Going down some of the individual fields, you need to specify a name so that you can have unique and consistent names for your PVCs that will get created. And then you have a count that specifies how many devices you want in this set. You need to mark whether your PVCs are allowed to move between nodes or not. You would think that this would not be necessary, but in order to make storage class device sets more flexible, allowing them to not be portable makes them a lot more useful in more environments rather than just the narrow scope of dynamically provisioning cloud environments. And then the volume claim templates is just a list of PVC specs. I say it's a list, but currently only one is supported in Rook. We made it a list to leave room for more advanced features down the line, but for now, you only need to spec. You can only specify one volume spec, and then this spec is used as the template for your count of devices that you want to use. As you can see here, we're using volume mode block. This is required, of course, because Seth wants to use a block device. We specify the amount of storage, the storage class name, and the access mode that I mentioned earlier. In this case, we want read write once because we only ever want one pod to use any given PVC at a time. The pros of this, naturally, since I designed this, I'm fairly biased towards this. You can offload device distribution to the Kubernetes scheduler, so you no longer have to worry about necessarily making sure that your storage devices are properly distributed in a balanced fashion or not. If you have the correct parameter specified, the Kubernetes scheduler will automatically distribute your OSD pods in a good enough distribution for redundancy. And in particular, this gives you the ability to migrate devices between nodes so that if your compute node goes down and you had any OSD pods on it, well, the OSD pod can now just migrate to any other node that's available and continue running, ideally, without any downtime to your data IO. And it works with any raw block PV, regardless of back end driver. And of course, it's shiny and new. The downside to this is that it requires predefined storage classes. So you must have appropriately configured storage classes ahead of time before configuring your storage class device sets. Device support is now obviously limited by whatever Kubernetes supports, so it's not quite as broad as what's available on a native Linux system. It's not as easy to configure since the spec is now longer than just three lines. And it's new and different, so that might put some people off. And now we're going to go over some of the issues we ran into in our implementation. So we as developers, when we write code, we just want it to work out of the box. We write a code, we run it, we expect it to work as it should. But that never happens, right? So this is the first issue which we faced. So OSD pods run as privileged pods. That means they require privileges. And in case of privileged pods, Slash Dev is mounted differently there. So Kubernetes prevents block storage to be shown inside the pod. So we came up with a solution. We had an empty directory. We mounted the block device to the init container. And that empty directory was also mounted on the privileged pod. So if we look here, so on the right hand bottom, we see there is an empty directory and there is a block device set to one dev zero. And in the left hand side, we see that on the privileged pod, we are having just one empty directory mounted. What the init container is doing, it's copying the block device to the empty directory. As in Linux, everything is a file, right? So it's copied there and we got the block device in the privileged container. Any questions on this one? So next problem which we faced was when we were running multiple OSDs on the same node. So Rook Chef uses LVM and what was happening that Slash Dev was mounted inside the privileged container. So we have all the devices Slash Dev Slash SDA, SDB and the block volume is presented as a loopback device into the pod. So what was happening that LVM was picking up the original device Slash Dev Slash SDA or SDB and also the loopback device. So it was confused which one to use. So for this, we found a solution that we were passing the LV name to the Chef OSD start command that is Slash Dev Slash name of the volume group and then the name of the LV. The third issue which we faced was proper distribution. So when we were running multiple OSDs, so all the OSDs were, suppose we had three nodes, OSDs were coming up only on one node. So if an OSD was having replicas, so if the three replicas are on the same node and if the node goes down, so we will lose our data, right? So for this, we found a solution that we were using basement affinities. So on the, here you can see this is a standard basement affinity which we apply to our pods. So now the pool is time. We are demo time and here is a cat peak and which is praying that hope it works. So we can work on a Kubernetes cluster but because of internal automations, we are using OpenShift cluster here. We can see that three OSDs are running. Two OSDs are on the same node and one is on triple two. So we are not applying any pod affinity. That's why two OSDs came up on the same node. And if you know about self status, so it is checking the health of the cluster. So we have three months, a manager and three OSDs which are up and in. So we are having four workers here, worker nodes. So in OpenShift, we are using something called machine sets. This makes it easier for us to add nodes or delete nodes. So what we are going to do is we are going to delete a node and we will see if the OSD is coming back up without any loss. In normal case, what will happen when the node is down and we will have to take the device out of that node and put it through the another one and several detectives as a new OSD and the data will be lost and there will be a lot of rebalancing and stuff like that. But in this case, when the OSD is back up is the exact same device running somewhere else. So we won't run into any issue. So I'm going to delete this 143 node, which is yes. So when I delete the node, so this OSDs are going to migrate to a different node. Now in this case, since we do have two OSDs, we did a really dirt symbol configurations. We don't have any affinities on there. So we have two OSDs on the same node. There will be disruption to the data readability or data writeability technically. The any volumes on those OSDs will go into read only mode until the other OSDs come back up. Can I just delete the node? Yeah, the machine. Oh, okay. Yeah, there we go. So we can see that the two OSDs went into a pending state and it came up on the two OSDs coming up on the other node that is triple two. I'm willing to bet the set tools pod we were using was also on that node. Yeah, so two OSDs are down and here the two OSDs are in pending state. And we will have to wait for these two OSDs to come back on the node. Check the pod status. Oh, the prepare is still running, okay. Oh, yeah. So in this case, because we're deleting a machine, OpenShift will also perform a drain of the node which will push all the pods off the node. When the pod gets removed, when the pod gets descheduled from the node, any PVC that it's running will get detached from the node. Now I specifically say the PVC gets detached from the node. What that means to the underlying storage system and the host itself depends on the storage driver that is managing that device. So on something like a GP2 volume in this case, it is in fact getting detached from the host in runtime. And then the GP2 volume just kind of sits in the ether for a while until a pod that is claiming it gets scheduled onto a new node. Ah, and they're running. Yep, they're running and the host is back to okay. That's all, it worked. And as we are not using any placement affinities, that's why that pods turned up on the same node as it would have been on the different nodes in different AZs. Yep, so one of the things we, one of the things we recommend, for instance, if you're running this in AWS, if you have access to more than one AZ, we recommend running your workers so that you at least have one storage node per AZ. That way you can isolate failures if an AZ goes down. Obviously if two of your AZs go down, well then you're kind of screwed, but hopefully Amazon service guarantees are such that you'll never have, you'll rarely have two, more than one go down at a time. Can you speak up a little? Wait, what do you advise us? So if anybody. So in this case, in this case what we're doing is using remote attach storage. So yeah, the storage lives separate from the actual host. In this case we were on AWS, so that was just a standard. So the machine is full? Well, the machines already exist, right? So the AWS node, the AWS instances already exist. So all that happens is that Kubernetes detaches the GP2 volume from one node and then will attach it to the next node that already exists. So the AWS EBS volume provisioner is coming into picture here. And is there a question more specific to your physical machines that you have in your labs on bare metals? So for bare metal machine and physical volumes we use locals, what was the name of it? Local PVs. No, local storage operator. So local storage operator is going to create PVs of that physical disks, and that PVs will be referenced here. Yeah. And that's how things will work on bare metal. So Kubernetes itself already has a notion of what it calls local PVs. This recent, Jan, keep me honest, did that go GA, local PVs, or is that still beta? Right, so local PVs are now GA generally available. And what those are is that it's still the same PV construct, but instead of defining storage against some external storage system, it defines storage directly attached to a node, right? So in our lovely cloud native case, AWS has a certain type of instance called I3EN, or just I3, I forget, that is just an instance that comes pre-attached with NVME devices, right? And those will just get presented as local volumes. So what you can do in raw Kubernetes is you can manually create a set of local PVs that reference those devices directly. Then when you create those PVs, you can specify an annotation that specifies a storage class name that those PVs will be associated with. So then you create this other storage class, and but then you set it to provisioner to a special value called no provisioner, meaning that there is no storage backend Kubernetes will engage with to provide that storage. So then what happens, you create a PV, the developer, as we like to say, creates a PVC as normal, and then references that static storage class. And then Kubernetes will take care of finding an appropriate PV, in this case what's known as a static PV since you create it manually, that meets the requirements of the PVC and binds now the static PV to the PVC. But it's also a local PV. So you now have a PVC that's referencing a local device, meaning that, and then with another feature that just came in called topology aware provisioning. The PV PVC entity can inform the Kubernetes scheduler to restrict where the pod that will consume the PVC will be scheduled. So you can literally just have a pod that references a PVC with a local PV underneath it. And then based on the restrictions of whatever PV it grabs, the pod will be scheduled, for instance, on your AWS instance that has your NVMe devices. Now I'm talking about AWS and cloud, et cetera, but this is directly what would apply to bare metal environments, right? You would simply create your local PVs for any devices you want to consume as storage devices in Kubernetes. And then you just specify enough devices, enough storage class devices at instances to consume all those devices. So a hand like that, right? So yeah, we can do that if it's a year. So we are running a Rookshift tool pod and here I am inside that pod and from here I can run all of the step commands. For example, we can accept status. We can, if you want to remove an OSD, we can do this. So we can mark an OSD out. So this is marked out and if you see the status, so two are in. So we can do all the operation which we can normally do in any step cluster. Yes. Yep. And then, oh, what was I gonna say? Hold on, can you say that again louder? So whether the Rook operator will spin up a new OSD if it detects that there are only two running? Oh no, in this case, no. If my understanding is correct, the OSDs, can you, let's see, the OSD pods are still running, right? So from the Kubernetes API standpoint, the OSDs are still there, Rook is fine, right? This is now internal to stuff which is beneath the Kubernetes API and thus the two don't necessarily interact. So when the Rook operator is down, it doesn't matter because after the cluster is up, it will be fully functional. So we will need the operator in case we are upgrading the cluster or adding a new OSD as we don't need the operator. So it's just there to keep track of things and if they are working properly or not. And one thing I just, I remembered the thing I wanted to say, can you show the, what's it, the tree, the hierarchy of the OSD tree or whatever it is? Yep, so the main thing I wanted to point out here is that the way we got this, the way we got this whole portable OSDs things to work is that we have set the host here to the name of the PVC, right? So traditionally in Rook, this would just be the same as a standard stuff cluster where host is the host name of the machine that the device is attached to. In this case, we set host to be the name of the PVC that is representing the device which is how we keep the OSD's identity. So when the OSD leaves the cluster that is marked as out. And then when it rejoins the cluster, it sees that the IDs match, so it's marked as in again. So this is the case when we are specifying portable is true. So when we put the portable as false, here the host name will be the name of the node where the OSD is running. Probably the question was, can you have multiple devices across multiple nodes represented as a single device? Maybe? I have never heard a compelling use case as to why this would happen. In particular because like, all right, so you could argue that you could have, oh, now I have this fancy distributed redundant device which is double redundant since stuff already has built-in redundancy for your data, right? So if you configure stuff correctly, you don't need to have further redundancy in your devices. So you're saying you have Prometheus, so you have Prometheus data, right? And are you saying, do you want to be using a stuff volume? Is that what you're saying? Right, so in this case, you're using a stuff volume. The stuff volume is a separate entity from the storage devices, right, from the OSDs. So a stuff volume is part of a PG, right? Yeah, stuff volume is part of a PG. Is it like process group or something? Placement group, thank you. You can see how familiar I am with stuff. It's part of a placement group and then that placement group contains a set of OSDs, right? Which defines which OSDs that the data on that volume will be distributed to. And it all comes down to blocks, right? At the end of the day, your volume is just a chunk of blocks and where the blocks get replicated is irrelevant from the perspective of the volume. So if your node goes down, the stuff volume just, in this particular stuff detects, oh, that client went away, the volume is no longer attached to that. And then when, in this case, you would have a stuff CSI Kubernetes volume, Kubernetes PV, that will just get attached to a new node and then the stuff CSI storage driver in Kubernetes will spin up a stuff client on the host of the new machine and connect to the stuff, to the stuff cluster, access the same volume, the same data comes back up. Yeah, I mean, the default crush rules set the, what is, no, so it all comes down to when you just specify your block pool. So Rook has another entity called the stuff block pool, which specifies, you know, I'm gonna be using these OSDs and here is I'm gonna be distributing my data, right? Yep, yep, yep. So I believe that if you just have a really dirt simple stuff block pool, the default replication point will be the host, right? So even if you have like, you know, three OSDs on the same host, if you have your replication point be host, the crush algorithm will only select one OSD from that host and find two OSDs on two other hosts. So, you know, the main thing, the main takeaway here is that, yeah, I'm not sure either, but the main takeaway here is that stuff can do, yeah, so the main takeaway here is that there's a lot of things that stuff can do that can be relatively easily configured via Rook resources, right? So what we just showed you here is a very sort of low level type of redundancy here. That is in a way sort of irrelevant to the higher level SEP configuration that you can do, right? So the real trick to this is that the way we wrote it, SEP doesn't know what's going on, right? SEP doesn't know and more importantly, SEP doesn't care. So it doesn't know, yeah, thank you. It doesn't know that the devices are moving between hosts or not. All that SEP knows is, is it down? Yes. Oh, it's back up, good. And that's it. So from the standpoint of a traditional SEP mindset, that host just went down and now it's back, the end. I'll just show you the clusters here once. So this is the cluster.yaml that comes with Rook, with the Rook project. It is obviously a heavily detailed and heavily documented piece of YAML that tells you how to use each and every field in the SEP cluster. Or ideally, each and every field we want you to be using in the SEP cluster. There is a, okay, correct. How do we check this cloud instances of storage? That is beyond the scope of this presentation. So we don't have that much SEP knowledge. Yeah. Yeah, so the question was basically, how can you guarantee that any particular cloud instance that's storing your data is storing your data reliably? What does SEP do to mitigate that? And the answer to that is, that's beyond the scope of this presentation. We are not necessarily SEP experts. We are sort of cloud native developers pretending to know about SEP. So take that with a grain of salt. All right, I'm being told we've got five minutes, so I'm gonna call it there. Thank you everyone for joining us. Have a good rest of the day. Safe journey's home.