 Hello everybody. Welcome to my talk about container storage interface in Kubernetes. My name is Honza Szafranek. I've been working for Red Hat on Kubernetes for past four years, almost. And I was always working on storage in Kubernetes. Before I start, I would like to know who you are and what, so I can tailor maybe the presentation a bit. So if you could please raise your hand if you ever created persistent volume claim in Kubernetes or OpenShift. I expected more. And who is somebody who administrates Kubernetes cluster or OpenShift? Either local for local development or in production for testing, whatever. Not so many. Okay. So I will start with container storage interface. What it is? In a nutshell, it is interface between container orchestrator, meaning Kubernetes, meaning Docker swarm, Cloud Foundry, Mesos, and these guys and storage providers. Storage provider here is pretty generic term. It could be traditional storage like IceCasin, FS, or Samba. It could be software defined storage like Gluster, SAF, and many new startups. Or it could be cloud storage like OpenStack, Cinder, Amazon, EBS, Google, PD, and this stuff. The standard interface focuses on workload that runs as containers. That means that the guy who consumes the storage runs in a container. It is not limited to containers. You can use CSI without any containers at all, but some of the concepts come from container world. In Kubernetes, I have good news for the users here. There are no changes for you. You will still use persistent volume claims as you are used to. You will use storage classes. Everything will work. I have bad news for admins because the things that we traditionally run inside Kubernetes, inside Controller Manager, inside Cubelet, now we run them outside, and you will see new deployments, new pods in your cluster. And you need to maintain them, manage them, and you need to know what they mean. And the ultimate goal for Kubernetes is to remove all the storage from Kubernetes code base and run everything as CSI. We have just started. It will take a long time, perhaps several years, until we are there, but you should be prepared that all the storage will be CSI, and as cluster admins, you need to care about that. So, how CSI started? Once upon a time in Kubernetes, we started writing volume plugins. Volume plugins in Kubernetes is a piece of code that talks to storage. Throughout the time, we accumulated 22 persistent storage volume plugins, and every now and then somebody came and wanted his own storage plugin in Kubernetes. It quickly became obvious that we can't maintain all of them because we call them plugins, but they are not real plugins. You can't dynamically load a new storage plugin. All of them are hard-coded in Kubernetes code base. You need to follow Kubernetes coding standards. You need to send pull requests to Kubernetes, and you need to follow also the release schedule. If you want to fix a bug or change a new volume plugin, so we invented Flex, a volume plugin which allowed storage vendors to install drivers on the host on all nodes in the cluster, and Kubernetes would call this binary when it needed to talk to storage, and this binary would do the stuff. The problem with this approach is that the installation of the binaries on the host is not that easy. Now, people usually distribute software as containers, especially in Kubernetes. They don't want to install binaries on the host, and some hosts like CoroS or Atomic Host, they don't allow you to install anything on the host. We thought we were thinking how to improve this situation, and we started implementing Flex 2.0, which was gRPC-based. You could run the storage driver in a container, and it would be much better. Somebody got a smart idea, so why don't we talk to Docker Swarm? Why don't we talk to Crowdfundry, MazeLos, and other people who run containers? They must have the same problems we have, and that's how CSI was born. CSI, on the high level, it's a project on GitHub. You can join the project, you can download the specification, you can implement it. There are no fees, no patents, nothing. You can create issues on GitHub, you can send requests if you want to extend the specification. There are no barriers. There are weekly meetings, I don't join them because they became slightly political. Every storage vendor has its own cool feature that nobody else wants, and that's super important to him, and nobody else cares about the feature. They want to push it to CSI. Also, storage in Kubernetes and MazeLos and Crowdfundry and Docker Swarm, they look completely different. In December last year, we released version 1.0, which is declared stable, and we will maintain compatibility with this version. If you run older Kubernetes, we implemented 0.0 and that's not compatible with 1.0. The incompatibilities are not huge. We usually rename some fields here and there to make it more obvious what it means. But still, if you have older drivers, you need to adapt them to version 1.0. On technical level, it's GRPC API between container orchestrator, for purpose of this talk, it will be Kubernetes, and the server guy who does all the work is a storage plugin. In CSI, the storage plugin is used in the CSI specification, but since in Kubernetes, we have plugins, we use term plugin for something else, we call it CSI, we call them CSI drivers. The communication between them is usually in over-unit domain socket, and as the protocol says, implies GRPC, it's some remote procedure call, so Kubernetes just calls functions of a driver. Driver does something useful and returns some value. CSI standard doesn't define anything about how to distribute the plugin and what the plugin is, if it is installer or it is executable or it is RPM package. In Kubernetes, we expect that these are containers. There is a freeing community of storage vendors who implement drivers, who started implementing drivers. In Kubernetes CSI, we maintain a list of the drivers we know about. If you have any driver that's not listed here, send us a pull request, we have no, there is no entry. We don't check the quality, we don't check the support, and actually the quality of the drivers is very different. I would say most of them, like alpha or beta quality, I wouldn't run them in production, but if you are a storage vendor, now is the right time to start testing your driver. So back to Kubernetes. How we implemented CSI in Kubernetes? We have 22 voting plugins, now we have 23, one of them says CSI. We started with alpha in 1.8, I think. It is enabled by default in Kubernetes 1.10, and it was declared stable in Kubernetes 113, released last December. If you use your OpenShift, you need to wait because it won't be in OpenShift 4.0, you need to wait for 4.1, maybe 4.2. While the implementation is declared stable in Kubernetes, we don't have all the features, we don't have block volumes yet, we don't have resize volumes yet. If you used alpha implementation of snapshots in Kubernetes, we completely rewrote that, and we have another alpha implementation of snapshots in Kubernetes, which is much better, actually. And if you want to use CSI, you must use it through persistent volumes and persistent volume claims. You can't use them in line imports yet. For the users, I have a very short demo. I have a cluster running on Amazon. It's OpenShift 3.11, and this OpenShift doesn't know anything about Amazon. It doesn't have credentials to Amazon. But it runs a CSI driver, and the CSI driver is the guy who knows about Amazon. So I have a claim. It's the claim you are used to. It requires 500 megabytes of some storage. It doesn't say which driver, what kind of storage. It doesn't matter for the users. I create the claim, and I can see that Kubernetes, when it ends, and talk to the CSI driver, and new volume was created just for my claim. If I look at the persistent volume, it's pretty standard persistent volume, just as the volume source is CSI. The driver that's responsible for handling of this volume is something that calls itself EBS, CSI, AWS.com. And another important field is volume handle, which is ID of the volume. The ID is interpreted just by the driver. Kubernetes treats it just as string and pass it around. And if you just create a pod in the pod, this is some busy box that does nothing useful, but it uses the volume, the persistent volume claim. There is nothing special in the pod, and the pod is running. So if you are a user, nothing changes for you, business as usual. If you are a cluster administrator, life becomes much harder for you. Yeah, you're welcome. Why it's so complicated? Because the parts that were traditionally running on the master, we had our volume plugins running on master that did the dynamic provisioning, that did third-party attachment on Kubelet, we had the code in Kubernetes that found the block device, created file system, and mounted it. These parts are not outside of Kubernetes. So to install CSI driver, the best thing I can recommend is to go to your storage vendor and ask how to install it. Storage vendors knows or should know what daemon set and status set and service accounts and secrets and RBG rules you should have to install their driver. But since CSI is quite early stage, there are emerging patterns how to install a CSI driver and some vendors get it wrong in the initial phase. It will get better, don't worry, but some of them are pretty stubborn. I don't have a demo of installation, but I can show you how Amazon CSI driver installation looks like. There is a read me somewhere that tells you, fill this secret with your credential to the Amazon and then create these objects. And that's it, that's the whole installation. How does it look like? You create these objects and it creates bunch of service accounts, cluster loads, status sets, another service account, daemon set, and whole zoo of objects that are necessary to run a CSI driver. I will go through the components that are necessary to run a CSI driver that must be in every CSI driver installation that will run in your cluster so you know they exist and especially why they are there. So the first thing you need to have to run a CSI driver is external provision. Traditionally, dynamic provisioning of volumes was done by master which directly talked to the storage API, for example, Amazon API or to equity in cluster. Now master does not know anything about storage, so you must run piece of the controller outside of master and this piece does external provisioning. Each CSI driver needs its own instance of external provisioner because they can have different timeouts, different parameters. It's better to run them separately. What the provisioner does is that it watches for an API server for persistent volume claims. If it sees a persistent volume claim, it sees that there is a persistent volume claim that needs provisioning and it's me who should provision it and it calls the driver that's running in the same pot as the provisioner using the gRPC over unix domain socket. It calls the CSI driver does whatever is necessary to create a volume and in the end the provisioner creates a persistent volume. You don't need to know the details, you don't need to know the score flow but you should know why the provisioner is there and why it's necessary. If your dynamic provisioning doesn't work, you should know where to look. So if something doesn't work, which is pretty often in this early stage or if you are developing a CSI driver and something doesn't work, you should look. The first thing you should do is to describe, describe, come on, I can't write, your persistent volume claim because that's the place where external provisioner sends events. In this case, I can see that the external provisioner started provisioning and it successfully provisioned something. If the provisioning went wrong, you will see errors here in the PVC. If it doesn't show anything useful, then you go to locks of the provisioner and your driver and so on. This should not happen if the driver is mature enough. The second thing we need to run in the cluster is external attacher. This external attacher, again, before it was implemented in control manager on master that directly went to storage API and did third-party attach of volumes. Now master doesn't know anything about storage so there is external attacher that does piece of that logic. Only few volume types need third-party attach, usually in the cloud where you have API that you can say Amazon, hey, I have a note and I have a volume, please attach the volume to that note and the guy who asks is not the note and Amazon goes and attaches the volume to the note and block device with the volume appears magically inside the note. But if you run, I don't know, NFS or ISCSI, fiber channel, raster, safe, you don't have that API that attaches remotely volumes to notes, but you still need to run the attacher. That's the most important thing. At least half of the issues we have on GitHub or Slack are that my CSI driver is not working and the first question we ask is do you run attacher? No, my volume doesn't need attacher. I have simple NFS. You still need to run the attacher because, as I've said, master doesn't know anything about the storage. Master doesn't know that there is no need for attacher right now. We are working on a fix, but right now you need to run the attacher. Again, how it works, it's pretty complicated. You don't need to know the details, but if something goes wrong, you go to CSI attacher logs. If there is nothing, you go to control manager on your master. Attachers does the third party attach, sorry, repeating a question. If attacher is thing that does mount on each host and runs this node, attacher does this only third party attach only in the clouds. The component that runs that does mount, I will have a slide later. Back to the attacher. If something goes wrong, there is a new API object called volume attachment which says that the controller inside Kubernetes doesn't know anything about storage. CSI will plug in and the plug-in has no interface to CSI driver intentionally because on some Kubernetes deployments, you can't install the driver on the master. You must run it somewhere else. So the volume plug-in just creates volume attachment instance. The instance tells please somebody out there attach this volume to that node and the external attacher running somewhere in the cluster sees that there is a new volume attachment object. It checks, hey, I should do something, I should attach volume. It talks to the CSI driver over the domain socket and CSI driver does the attach. So I have a pod running and I have a volume attachment for that and the volume attachment in the spec part says who should attach something, the CSI driver, what should be attached, this volume and where to this node. And the attacher does that and fills the status. In this case, it was successful, volume is attached, no errors. If there were any errors, you would see some messages here. And the second thing that you can debug is pods. Here you can see attachment succeeded, but if it's not succeeded, you would see some events here, why and what went wrong. So again, even if your volume kind is not attachable, doesn't use for the party that you still need to run the attacher. And the last component we have is called node driver registrar. And that's the thing that runs on every node and it registers CSI driver to Kubler. I have a picture here, it's taken from OpenShell documentation because I'm bad at drawing pictures. So on the node, on each node, you have pod that runs the driver and here the month happens and also there is a driver registrar that registers the driver to OpenShell node service which is fancy name for Kubler. This is Kubler. But Kubler talks directly to the driver using the main socket and it's the driver that does the mount on the node. If you have Iskasi or NFS or basically everything needs to be mounted. So you run the demon set that installs the driver registrar and the driver on each node to have CSI and somewhere in the cluster you need to run a pod with external attacher, external provisioner which talk to the copy of the driver. Are there any questions? Because that's the complexity. You need to run external components, you need to know what they do, why they are there and if something happens where to look. So the question is if, in case of Amazon, if the volume is directly attached to the host or there is something in the driver that translates it somehow to the pod. And the answer is no. CSI does not stand in the data path. So the question is if, in case of Amazon, if the volume is directly attached to the host or there is something in the driver that translates it somehow to the pod. And the answer is no. CSI does not stand in the data path. So this is a single node cluster and what external attacher does, it attaches the volume directly to the node. So I can see xvdbx is directly available as a device on the node and what the driver on the node does then, it mounts it directly xvdbx. So the driver running on the node then mounts it wherever Kubernetes needs it. And it stands in the outside of the data path. It doesn't translate the data in any way. It just mounts the thing to the container and container sees it as a mom phone. Any other questions? There? The question is if there is some way how to limit IOPS because in Kubernetes there is no support for limiting IOPS. And the answer is no, not yet. Petchies are welcome. There is no global option. I know that in case of Amazon you can require volume with distinct IOPS and that's implementation of the driver. But there is no global option for all drivers. Anything like that. Any other questions? Question about node failures? Well, I'm not sure what you are asking about. If node fails then would you care about storage? Storage is gone. If a node fails then eventually all the pods from the nodes are evicted and since we can use third party attach and detach, we can detach the volumes from the node. And something in the cluster deployment, state-of-the-art set, whatever, starts new pods and the data will be mounted to these two, the pods will be scheduled to some other node that's running and the data will be mounted to the node. There is no difference between CSI and the Kubernetes today. It was the same. And there was another question. One module. What is module? Yeah. You can run, the question was, what if I have more CSI drivers like NSRS, Ice-Cuzzie, whatever? And the answer is you can run as many drivers as you want and each volume, each persistent volume in the cluster has a field that says which driver handles this volume so Kubernetes knows whom to call. And the dynamic provisioning happens based on storage class and each storage class has one CSI driver behind. Okay, I collected some mistakes that people usually do when they install CSI drivers. Maybe they are not mistakes per se, but I think there are better things how to make it. As I have said, you must run external attacher, external provisioner somewhere in the cluster in a pod and it should run just once. One provisioner should be running at the time and one attacher should be running at the time. They are running for the driver. One driver needs one attacher, one provisioner. People use Statefacet for that because that guarantees that only one attacher runs at the point. If you use deployment with one replica, it can happen in some corner cases during deployment, during updates that two replicas are running at the same time. So people use Statefacet because it guarantees that only one replica runs. That's true, but if your replica is on a note that goes down, it may take some time until the note is healthy. Until Kubernetes discovers that the note is unhealthy and moves the pod somewhere else. It can take up to five minutes. And during these five minutes, you are without dynamic provisioning, which is probably okay, but you are without attacher. You can't attach volumes. You can't start new pods. In this case, I recommend use deployment with multiple replicas and use leader election. All the components we have, the provisioner and attacher, they both have leader election. So they make sure that only one replica is running. And if one replica dies because it's on unavailable note, the new leader can be elected in a few seconds. Other things I noticed, especially in the Omplicev case, Kubernetes uses the CSI driver on the note. It needs to mount volumes. It needs to mount volumes in Warlib Kubernetes. That means that the driver pod needs to have mount propagation and everything for Warlib Kubernetes. But if you run OpenShift, we don't use Warlib Kubernetes, we use Warlib Origin. So if you use non-standard OpenShift Kubernetes distribution, like OpenShift, beware not all CSI driver vendors know about that. In OpenShift for a while, we will have Warlib QLED. Finally, these sidecar containers like extranet hr provisioner and driver registrar, they are tightly bound to Kubernetes version. So if you update your Kubernetes, you must update also these sidecar drivers. There is a table in our documentation that says it was here yesterday. Here it is. What version of Kubernetes needs what version of, no, it's a CSI spec. It's not the driver. Sorry. Somebody merged the, somebody merged a bit of documentation, which is good, actually. The CSI itself changed a bit, but unfortunately in combatable way. So if you have driver that implement the older version, you need Kubernetes 112. If you have newer version, you need Kubernetes 115. The most important, some people don't run Atecher and surprise, please run external Atecher, please. And since now you have pods in your cluster that care about storage, you must make sure that there are always running. So you must give proper priority of this pod. So if somebody schedules important workload, it could happen that your, that a pod without priority is killed, and you don't want that happen to your storage. So please run the pods with high enough priority so they are not killed accidentally. What's the current status in Kubernetes? The deployment is not trivial. We are working on that. Maybe it will become easier, but in the end it depends on the storage vendor because the storage vendor knows what it means to install a driver and creating some Kubernetes object in the end is the last step. Usually you need to prepare the host, open firewall ports, I don't know, kernel modules, whatever. Each storage is different. So you should go to your storage vendor. The community around is very active. It doesn't look like that, but we are moving forward quite quickly. You have seen beta in 110, stable in 113, that's nine months. It doesn't look like quick, but it is pretty quick in Kubernetes. And we know that our documentation sucks. We are working on the documentation. There is a link to a pull request that improves the documentation. And if you want to join the community, contribute something either on the Kubernetes side or the driver side, this is an excellent starting point because you can review documentation, fix typos, fix the grammar. If you don't understand something, you can fix it. And while reviewing the documentation, you will learn a lot about CSI and how it works. In Kubernetes 113, we introduced alpha support for volume topology. That means that if your cluster is not uniform, for example, in Amazon or in cloud, you have availability zones and regions. And some volumes are available only in some regions or in some availability zones. Now we have a way how to show that in CSI. So each volume can be in different region and they can be attachable to different nodes. And now Kubernetes knows about that. So if you have a volume in zone, I don't know, US East 1, then Kubernetes will go and schedule a port in US East 1 where the volume is accessible. And we are trying to remove the requirement that you need to run extranetature for non-attachable volumes. It is alpha in 113, but still you need to, if you run without alpha features, you need to run extranetature. On the roadmap, we have a lot of cleanup and stabilization to do. We want to implement resize and block volumes. It should be pretty easy. With heart, we want to do inline volumes in ports. So far, CSI was consumable only through persistent volumes and persistent volume claims. But for some kind of volumes, for example, in Kubernetes, Secrets, Config Maps, then Word API, you don't create persistent volume for Secrets, right? So we want to have something like that in CSI where you directly in a port specification, you reference a CSI volume, and you can implement something like Secrets that downloads the actual credentials and Secrets content from, for example, HashiCarp vault, HashiCarp vault or somewhere else, then Kubernetes API object. Or you can inject, I don't know, Kerberos tickets or whatever into your ports. And this is quite difficult. And by the way, volunteers are needed. We want to do some performance testing for drivers. That means currently we run our Kubernetes N2N test only on Google Cloud and only in Amazon. That means we can test only GCE and AWS CSI drivers. And we want to create a binary or something with this N2N test so you can download the binary and run our test, our Kubernetes test, in your data center with your CSI driver with your storage. And the ultimate goal is to remove the storage from Kubernetes. You would like to do that. And how do we want to do that? The current plan is to do it seamlessly. So you have Kubernetes that's using internal volume plugins. You're happy with that? You update to new Kubernetes version and magic happens and instead of internal volume plugins, you will use CSI. That's the goal. We are very, very, very far from there. There are some obstacles because we want to upgrade seamless. But Kubernetes has also policy that downgrade must be seamless. So we upgrade to new CSI version, new CSI, everything is working. And you discover a bug or something that's blocking you. So you downgrade and the downgrade should be also seamless. So from CSI, you downgrade to something that's internal. And that's where we have issues, actually. We don't know how to do that. Maybe we will have something in alpha in 1.14. I'm quite skeptical, but you will see. And another thing that's blocking it is that Kubernetes API is stable. And we have quite strict policy about that. So if I have person volume that uses Amazon API, then this API object, this person volume, must be working also when we move the storage parts from Kubernetes to CSI. So we will need some translating library and this kind of stuff. It will be fun. If you want to contribute, this is pretty challenging. If you are bored with maintenance or something, this is challenging. This is new development. This is fun. Until we have the migration, the automatic migration, if we have it, I don't know, you can do it manually. If you have person volumes in your cluster, here's the way how to do that. Please back up. Maybe you will need it. It should work if it does not please report bugs. There are not so many people who try that. Basically, you have a cluster with your person volumes. You must stop everything that uses the person volumes. You delete the person volumes and replace them with the CSI counterparts. In the API, you can't edit the person volumes right now. This is prohibited by validation. Sorry, we can't do that. So you must delete it and create new one. And if you do that, it should work. I'm quickly approaching the end of my presentation. So CSI, it's GRPC interface. It's focused on containerized workloads, but you can use it outside if you want. For users, no changes. For administrators, some headaches. I think you will get used to it. It's just few posts running somewhere. Nothing special. And storage vendors don't need to go to us and ask for new volume plugin. They can implement it at their own pace, at their own release schedule, whatever. We don't care in Kubernetes. Here's a link to the specification. It's open. You can create issues and pull requests, add new features as you want. But if you add new features, please remember there's somebody who will need to implement it on Kubernetes site. And I don't want to implement crazy features. We have a Slack channel on six storage and we are looking for volunteers. We have on Slack workgroup CSI. It's pretty demanding. There is a steep learning curve, but it's fun. And please remember to run the external literature. That's literally, it's half of our issues. Every day, somebody asks on Slack, my CSI doesn't work. Please run the literature. Are there any questions? Yes. CSI to migrate to CSI from normal plugin. So the question is, what are the advantages of CSI? For cluster administrators, there are none. For me as Kubernetes maintainer, I need to maintain these storages as internal volume plugins. I have no clue. Sorry, these are CSI drivers. These are volume plugins. I have no clue. I can't test it in Kubernetes. So as a benefit for me, I don't need to maintain these beasts. I don't know what they do. And also, what works and storage OS, these are startups. You don't know. Maybe next year, I don't know. We have one plugin that doesn't do anything. Nobody knows if it's working. And the major benefit is for storage vendors. They can plug in their storage to Kubernetes without any code in Kubernetes. We don't accept any new volume plugins at all. That's the only way how to plug your storage. Any other questions? So the question is, if the development of Kubernetes volume plugins stops and we won't be adding new features, and the answer is sort of, we implemented, for example, block volumes recently and topology of our volume scheduling. And we implemented them for internal volume plugins first and CSI follows. But at some point, the CSI implementation will get the new features. And I think that CSI will start getting the new features. And the old volume plugins will be just, it will be there. It will be supported. We will fix bugs, but we won't add new features. For example, we have snapshots now. And snapshots are supported only in CSI. We don't implement snapshots for the old volume plugins. Please use CSI and don't use entry volume plugins. Any other questions? Yes? So is it compatible with the CSI version or Kubernetes version? The question is, if the CSI driver is compatible with CSI spec or Kubernetes. The goal is that the driver should implement some version of Kubernetes spec, either pre 1.0 or post 2.0. And it should be compatible with all Kubernetes version which supports the spec, ideally. So far, we haven't seen any troubles with that. I can imagine it will break up at some point, but drivers should follow the CSI spec and test on Kubernetes. Does it answer your question? Any other question? If not, please install CSI drivers. It's fun. It breaks your cluster in very strange ways. You are not accustomed to, but yeah, that's the future. Sorry.