 Okay, everyone please welcome with me Ron and Jose for some more Kubernetes fun All right. Welcome everyone We're here to discuss a relatively new feature that just made it that made it into Rook It's about how Rook can now use persistent volumes to access its storage rather than going through local hosts directly in Kubernetes a Fair warning this may contain opinion speculations and bad jokes that are entirely our fault Please do not blame Red Hat IBM or the Rook project for anything. We may or may not say All right, who are we? I'm Jose. I'm a senior software engineer at Red Hat. I've been in and around storage for 10 years I work with OpenShift container storage or OCS with a focus on Rook and Cep I'm a project lead for the OCS operator, which we will not be discussing in this talk that comes later and I tend to like hitting things mostly drums over here. We have Rohan Gupta. He graduated from college in 2018 so young He also works in OCS with me. He likes watching anime and riding motorbikes. All right, and here's where we're gonna go We're gonna go through a bit of a setup. I'm gonna skip a couple slides since most of you were already here for the Rook talk And then we're gonna go over how OSD's work worked then and now And hopefully give a proper demo Network permitting. All right to start storage and Kubernetes Who here is familiar with general storage and Kubernetes PV's PVCs and all that? All right, most of you this should be fairly easy. All right, so You have PV's which are the representation of the actual storage volumes and Kubernetes you have PVC's which which represent a user or developer request to use storage and you have storage classes which are a point of PVC's to create to bind to create to get access to and bind to PV's Here's the general flow in a relatively small diagram slides will be available online if you want to look at it later, but Basically the user creates the persistent volume claim the admin or operator Creates the storage class the PVC talks to the storage class which talks to the storage back end Creates the persistent volume which then gets mounted by the pod. All right now did a new stuff raw block PV's This just went beta in 1.16. I believe And it allows Kubernetes to present storage to containers basically without a formatted file system So if you've heard about block storage in Kubernetes before this time You weren't actually getting like actual block storage. What you were getting was a file system formatted onto a block device But now with raw block PV's Kubernetes is able to present a Actual to present just a file Because you know it's Linux that represents the block storage device the the block storage device that Can be used by the application running in the container and Many applications like databases MongoDB, Consandra Already are capable of leveraging Block devices for their storage with you know, no additional configuration needed some of them require additional configuration if you want to use a pre-formatted file system and You know generally avoiding the file system layer Will tend to increase IO performance and give you lower latency So this was a much asked for feature for a while in Kubernetes And the way they implemented this or the interface that they implemented for doing this is called the volume mode Which is a new field in both PVC's and PV's That specify aha, I was wrong. It went beta in Kubernetes 1.13. There we go. So This basically takes one of two values and specifies how you want to access the volume either in file mode Which is the default backwards compatible default Yeah, the backwards compatible default or block which is how you activate the new feature and this field must match in Both the PVC and a PV This is similar to you know any other Required field in a PVC that must match to a given PV to be bound to it Here you have the YAML because you know lovely YAML. Don't we all love it? and You see down on the left in Regular File volume mode you can just omit this field right because it defaults to file And it would be the same in a PVC and Over on the right you see the normal way of specifying Where a PVC gets mounted inside the container using the volume mounts field Specifying the name of the PVC and the mount path where it appears as and you know That's where the file system gets mounted Volume mode block on the other hand you can't omit the volume mode obviously Even if you're using so The support of whether you can use file or block Is Determined on a case-by-case basis for each storage provider in Kubernetes So even if you choose a storage provider that somehow only supports volume mode block those may come You still need to specify volume mode block because obviously it will default to file And if your search provider doesn't support file your PVC and PV won't bind The thing to note is that there is now also a new field in the pod spec called volume devices This is where you again specify the PVC name But instead you specify a device path, which is the name of the file that will represent the storage device Now a quick note here volume mode and access mode are not the same thing. No, are they related? Access modes as you may be aware Specify how a how pods may interact with a given PVC So RWX stands for read write many meaning that multiple pods can read and write to the same PVC RWO stands for read read write once which Means that only one pod can mount PVC at a given time and The support of access modes versus volume mode is again dependent on case-by-case basis for each storage provider, so Just because just because a Storage provider supports RWX volumes in file mode does not necessarily mean they'll support RWX volumes in block mode So you have to know a little bit about which storage provider you're using To make sure that syncs up All right, here's the slides. I'm gonna skip since we'll see you already here Rook and stuff. What is Rook? Rook operators following the operator pattern We're specifically interested in the Rook seph operator since this is the first one to make use of this new PVC PVC feature And the main thing I want to go over here is that all the stuff Demons are Encapsulated in their own pods they run across various notes and Kubernetes The daemons we're interested in are the OSD demons Which are the ones that actually Represent the underlying storage and in fact are the pods that bind to to the block devices right OSD's Typically local storage OSD's Have access to our privileged pods that have access to the full Slash dev directory of the host they run on So the way you set this up is that you define your storage nodes Either by name by labels or you can use all nodes as the specified here And then you can then you have to specify your local devices and that can either be done by hand using Using PV yaml's or you can just use use all devices and make use of Rook's auto discover demon And then Rook takes care of the rest right so it has an OSD prepared job that Goes out and finds the devices and formats them appropriately to be used by the OSD's and then assigns given PVC's to particular OSD's The advantage of this is that it's really the base case is really easy to configure And for many people many storage admins coming into this in particular or in fact just anyone who's ever played with the Linux system This is very familiar right because you're just directly accessing a given device at a given name on a given host All right And it supports any type of device or appliance that Linux supports right because all we care about is a file As long as there's a file representing That given block get that given storage volume we can make use of it The downside is in at least in my opinion is that this relies on specialized storage nodes Right, so if you have a Kubernetes cluster you need to have at least some of those nodes Have local devices attached to them which may not be your standard compute node That you're using in the cluster And in this case there's a rigid coupling between your compute and your storage Meaning that you know if a given if a given node goes down it brings down not only that node But the storage that goes along with it So we implemented something called storage class device sets it's a bit of a mouthful. I know So once again, you define your storage nodes and your rook stuff cluster names labels are all But then you don't define your individual disks or you don't even yeah You don't define your individual disk you define a desired amount of storage right so And then rook takes care of the same automation as before preparing the OSDs and starting the OSD pods What does this mean in that case so Real quick here's what a storage class device set looks like It was designed to be a generic rook struct, but currently only rook stuff is using it You have to specify a name to give your set a unique identity that can map to the pvcs it generates so what this is going to do is It will dynamically provision a A a pvc a single pvc for each OSD pod Right, so now You're not going out and finding devices to attach your OSDs You are defining your OSDs and creating devices for them And then count is the number of devices you want in this set so you want you know 100 gigabyte devices you want five of them in this particular set you specify five there We have a portable field that defines whether pvcs are allowed to move between nodes or not Again, this depends on your underlying storage provider Some of them do allow pvcs to move between nodes some of them don't And then you have volume claim templates It is a list of pvc specs That's all it is Currently only one is supported for rook. We're looking to enable for a couple more In the future to take to make use of more advanced features in rook stuff And Yeah, if you're at all familiar with the pvc spec Basically, it's just the it's just anything that can go into spec Pros about this as far as I'm concerned It offloads device distribution, right? So if you wanted to have robust storage You would generally have to make sure that your devices were properly distributed across nodes across zones across data centers Whereas in this case you can Programmatically define what that distribution looks like and the devices will be provisioned at that distribution automatically and Presuming your store your underlying storage allows for it. You can migrate devices between nodes Works with any raw block pv regardless of drivers. So if Kubernetes supports it, you can use it And it's shiny and new some people like new features for the sake of them being different On the on the downside this requires predefined storage classes So instead of having your previously defined devices, you need to have a previously defined storage class Which can only be done by a Kubernetes cluster administrator So if for some reason if somehow you're running rook on someone else's cluster You'll need the admin to create the storage class for you as While I said that you know anything that's supported in Kubernetes is acceptable This is also limited only by what Kubernetes accepts, right? So there must be a storage driver already existing in Kubernetes for you to make use of this It's also not as simple to configure the base case right is inherent is inherently more complicated to set up than The base case for how OSDs used to be definable And it's new and different some people are suspicious of new features, especially when they haven't been you know tested in the field yet all right, and Now we're going to go into some bumps and road that we Ran along the way in trying to implement this so we as developers when we write code and we think that when we will hit Build it should just build and work out of the box, but that's never the case, right? So these are some of the issues which we face while writing this feature So OSD pods run as privilege pods. So when Kubernetes presents a block device to the privilege pod, it's a Bind mounted and it's presented in a different way. So we don't get our block devices in the privilege containers and We really needed our block devices on top of privilege containers because we were using LVM and we needed that so we Figured out a work around so what we did we had a a common shared Directory on a init container and we attached the block device to the init container Which was not privileged and the OSD container was running as a privileged container and as the shared directory was there In the init container as well as a privileged container. We found a way to copy the block device there to make it work. I Don't know if it is visible or not at the back But here on the right-hand side if you see there is an empty Directory that is a set 1 dev 1 bridge and we have a block volume that is a set 1 dev 0 that's That's mounted on the init container Here on the left-hand side, we have the main OSD pod which has the only which has only the Bridge directory and what init container is doing it's copying the set 1 dev 0 device to the shared directory there and the shared directory is here So we get the block device here on the main OSD container as we know in Linux everything is presented as a file So we are just copying a file where the init container and we are getting it here on the main OSD container So the next issue which we face was when we were spinning up multiple OSDs on the same node So we had the slas dev mounted in top of the container and the block device was presented to the pod as a loopback device So what was happening was that LVM was picking up the Device from the slas dev as well as the loopback device and it was getting confused which device to pick for and the SF OSD start command was the main thing that was confused for this So the solution we found out was we have to use the complete name of the LV that is slas dev the name of the VG then the name of the LV and The other issue which we faced was proper distribution So when suppose we are having six OSDs and on three nodes all the OSDs were coming up on the same node Which is very which is really not desired because we have the replicas of OSDs Suppose the replica is three and all the three replicas are on the same node and if the node goes down So we lose all the devices which is really bad for us So we to solve this we use placement affinities now the most interesting part the demo and We have a pic of cute cat. That's praying. Hope this works Okay, so we have three OSDs running here Which one is on two four seven one is on two three three and one is on one five six nodes and if you see the status of SF it's a healthy and We are using an open-shift cluster because our internal Automation makes it easier for us to spin the clusters. So we're using red hat because red hat We have four work and was here. So which is our own three nodes and what we are going to do is we are going to delete one of the node which has a OSD running on it and That always is going to migrate on another node. So I'm going to delete this 156 machine and hope the OSDs migrate to the different node and for 156 Yep Here is the machine for that okay, we deleted the machine and One of the OSDs went to the pending state because it's looking for the node and we see there is another OSD that also in a pending state because it's looking for a new node to jump to and The toolbox did there we go. Yeah, so One of the monies down because it was there on the node which I deleted and We have three OSDs two are up and three are in So one OSD that is in pending state is the one which was lost because of the machine which was deleted and Let's wait for it to come back on another note. Do you see get pods on the on the storage? namespace Oh, it's already running. Yep. So the OSD is back and In some time the health will be okay There it is So in warning because of the money's wrong. Yep, but the OSD pod came back up So yes, and then at some point the Ceph cluster will reconcile, but our feature succeeded And it just went out okay, right. Thank you But wait, there's more But wait, there's more You may have noticed that we omitted one key thing that some of you may be interested in what about on premises so I kept talking about, you know OSD portability and Migrating between nodes and all that which is all great for cloud environments and the demo was on AWS Right, but you know, we don't necessarily it sounds like we're leaving, you know traditional on-prem Storage in the lurch not the case Because also in Kubernetes 1.14 We have the notion of local block PV's so Or rather also in 1.14. We have the notion of local PV's which when combined with the When which when combined with the block PV gives you obviously local block PV's and It allows you to access Local volumes via the PVC PV interface rather than specifying a direct host path on a given node in the pod spec right so this is trying to decouple the Specification of local storage from a pod to a separate entity and The way this works is that you create a local PV with a storage class name and you have to specify a node affinity so You see there in bold. There's the storage class name above it is volume mode block and then a node affinity that Specifies what node this device is on? All right So then what you have to do is create a storage class using no provisioner right because the device already exists So there's nothing to provision in that case and then you specify something called volume binding mode Equal wait for first consumer Which enables a new feature called topology aware provisioning? What this does is that it allows the Kubernetes pod scheduler to take into account the locality Of the PV the pod is binding to All right, so when the scheduler is calculating what nodes the pod can be scheduled on it will You know it will Ping the PV spec to see if it has any Locality information like a node affinity And input that into its algorithm to get a list of valid nodes And then after that you just create your PVC and pod spec as normal So in this case you would just create another PVC that specifies. Oh, yeah, it's right there the specifies the local storage Storage class which I just made these slides like 30 minutes ago So I guess I forgot to specify the storage class And then you make sure the parameters match the PV you want Read write want block mode same storage size And that's it. So There's a lot of work on the back end for the administrator to take care of right because the administrator is the one that wants to be created The local PVs in the storage class associated with it But from the developer point of view and right don't we all just love developers and making their lives easier? It's no different than using any other storage All right, thanks again Questions question time anyone we got one That is dependent entirely on the stuff. Oh, yeah What about Rook providing block mode PVs from from from like say RBD in Seth That is entirely dependent on the storage driver itself Not the operator and I'm happy to say it's supported in Seth CSI So Seth CSI RBD will support block mode PVs I'm gonna say either. I was not very interesting or I was very or I described myself very well. So oh here we go I Duplicate PV issue an LVM any reason you didn't solve that by Duplicate PVs in Duplicate PVs in LVM any reason why we didn't solve that in this feature Quite frankly because I'm not aware of it So I can look into that and see what we can do about it. Yeah, you can just tell LVM to ignore certain devices We didn't do that Right, so apparently there's a feat you can just use a reg X to have LVM ignore certain devices, but We didn't do that then or did I'm wondering if that wasn't because we couldn't guarantee the name of the device No, we had the name. Okay. So yeah, maybe something to look into again. All right. All right. We good. Thank you