 Okay. Cool. Okay. Awesome. We'll have the time here. Okay. Cool. Okay. So I think we can start. So, welcome, everyone. So my name is Ricardo, I'm a software engineer at CERN. Hello, and I'm Robert Vasek, I'm a student of Informatics at the University of Slovakia. In my first year of the master's degree, but for the past eight months I had the amazing opportunity of having my internship at CERN with Ricardo as my supervisor. I was working on CSI surface and manual integration into Kubernetes. Okay. Cool. So today we'll talk about the work we've been doing. It's been, so these are the topics we'll cover. Quick introduction to CERN. I'm sure you've seen other presentations already during this summit or other summits. So CERN is located in Geneva, close to Geneva, across the Swiss French border. It was founded in 1954, it has 22 member states, and it has as main mission fundamental science. And we try to answer like big questions like what is 96% of the universe made of what's dark matter, what's dark energy? What was the state of matter just after the big bang? And why isn't there any antimatter in the universe or why don't we see it as it supposed to be in the same amount as matter? To try to answer these questions we build very large machines. The big one we have right now is the Large Hadron Collider. So this is a picture of the accelerator. It's a tunnel of 27 kilometers in circumference and it's 100 meters underground. So here we accelerate protons in different directions, close to the speed of light. We increase their energy and then we make them collide in specific points where we built this massive particle physics experiment. This is a picture of the CMS detector. You can see the people there to have an idea of the size. The cover is I believe 40 by 40 meters and it's totally full with this detector. It's also 100 meters underground. So the result of all of this is that from these collisions we generate a lot of data. We generate tens of petabytes per year that we need to store and analyze. For this we have a large private cloud based on OpenStack. We can see on the top that we currently have just over 300,000 cores. We have 36,000 VMs, around 10,000 hypervisors, and then you can see the magnum clusters. So these are mostly Kubernetes clusters. These are container clusters. We also support swarm and RISOs, but they are mostly Kubernetes. And every time we do a screenshot this keeps increasing, which is a sign that containers are getting more and more popular at CERN. Right now we have more than 400 clusters. So I would like to open this presentation by having a look at the title. So let's see. The dynamic storage provisioning of manila slash surface shares on Kubernetes. So that's quite a mouthful. But what's really important here is for my part of the presentation it sounds like I should know what I'm talking about, right? So anyway, it does serve a purpose. It basically says that the users of Kubernetes should be now able to create and use manila shares of their open-stack cloud from within Kubernetes. So while this title covers all the technical aspects of this presentation, this one really says how the development actually went down. Because to be frank with you, when I was finished with the implementation of CISFS and the manila provisioner, and we conducted our first benchmarks, those parameters that we got from those were nowhere near our expectations, as in pretty bad. So we managed to eventually do a pretty good job with this. So let's talk about the container storage interface. This interface I was using during the development of CISFS. And now let's see why and what it actually is. So let's imagine this scenario where we've created this storage system. We would like to get it to as many of customers as possible. And in order to do that, we need to integrate the storage system into existing infrastructures of our users so that people can actually use it. So let's say our customers are interested in container technologies. So we'll focus on those and we want to target the container orchestrators one, two and three. So that means we'll probably have to have three separate teams developing storage drivers for those three orchestrators so that our storage can talk to those orchestrators. Let's say that after some time we are done, our customers are happy, and now we are left with three separate storage drivers, which essentially do the same thing. They map the storage system to the container orchestrator. Now that's really not what we are looking for because we have a few problems. This lack of standardization between orchestrators is pushing our development and maintenance costs up. Another problem is usually those drivers are part of the orchestrator's code base, which means that adding features and fixing bugs is particularly hard simply because those drivers are now tightly coupled with the release cycles of those orchestrators. Due to this tight coupling, bugs in those drivers can propagate through other components of the orchestrators and crash different parts. So it would be nice if we had a common interface between all of those orchestrators. That way we would only write a single driver and use it everywhere else. So of course I'm talking about the container storage interface, CSI, which is the industry standard for cluster-wide storage plugins. It's being developed as a collaboration of communities, including Kubernetes, Mesos, Docker, and Cloud Foundry. It basically defines an interface between the orchestrator and the plugin. From this you can see it's not just a simple interface, it's a whole protocol defining the behavior for both the orchestrator and the plugin. So the end goal here is to have only a single storage plugin and use it everywhere else. In any CSI enabled container orchestrator. So the first alpha version of CSI was released in December 2017 with already working implementation in Kubernetes 1.9. Of course things have changed significantly since then. And for people like me who are already using this interface and writing drivers, it was sometimes a bit difficult to follow, but things are considerably more stable now. And actually just today was released the first, the second release candidate of the stable version 1.0. And the latest stable version is scheduled by the end of this month. So now let's have a look at the actual structure of CSI. So the specification defines three services that deal with storage in container orchestrators. The first one being the identity service which allows the orchestrator to query for plugins capabilities, hold probes and other metadata. Then we have the sort of control plane of the plugin, the controller service, which creates, deletes, lists, volumes and their snapshots. Lastly we have the node service which stages and publishes volumes on a node making them available to workloads on that particular node. So all of those services then form a CSI plugin. Now this CSI plugin can be compiled and you have a single executable file or you can actually split it into two separate plugins which can be deployed at their places in the cluster. So regardless of the structure orchestrators talk to CSI plugin using the GRPC framework over a simple UNIX domain socket with the orchestrator being the client issuing RPCs on the CSI plugin and the plugin comes up with some sort of response for those. So really quick look at the RPCs themselves. So for controller service we have created the lead volume as I mentioned earlier. For controller publish volume, this one attaches a volume to a certain node. This is required for storage systems that require to know in advance which node is going to consume that particular volume. For node service we have node stage volume which prepares the volume before it's consumed by workload and then we have node publish volume which actually exposes this volume to the workload and is ready to use. And lastly the identity service which provides information about the plugin. You can see some of those items being grayed out. It's because they are optional and that's simply because not all storage systems require for example volume attachments. That's why controller publish is optional. The same goes for node stage volume and actually the whole controller service is marked as optional because we need to accommodate node service plugins only. Okay so the way this is implemented in Kubernetes there is quite a long and really good documentation on Kubernetes websites but the gist of it is that there are four main components that deal with CSI in Kubernetes. The first one being Kubla itself which implements the entry volume plugin for CSI and acts as a client for the node service. Then it basically just calls the node stage and node publish volume when necessary. Then we have sidecar containers. Those accommodate the rest of the RPC. So if we take a look at the right side of the slide let that be our Kubernetes cluster and let those little logos be our nodes. Now let's say we want to deploy the plugin in such a way that every node will have access to our storage. What do we do? We simply deploy it as a demon set and be done with it but that's not really enough because even though each node is now running an instance of the plugin Kubernetes has no knowledge that those are actually CSI plugins and that those can be used. For that we need the driver registrar which uses the Kublet plugin discovery in order to register the plugin with Kublet. Then we have external provisioner and attacher. Those function in a very similar manner of hooking up into Kubernetes event system and responding to events. We have in the provisioner this one is interested in persistent volume claims and the response to those events. So when the user is either creating or deleting PVCs the appropriate RPCs could go. The same goes for the attacher where the control publish or unpublish is called when volume is about to be attached to a node. Okay so finally into something I've been working on I just don't talk about other people's work. Okay so the CSI plugin for the safe file system provides an interface between a CSI enabled container orchestrator and the safe cluster. Using this plugin users can provision and use surface volumes from container orchestrator. So for mounting it supports both the kernel surface client as well as the fuse driver and users can pick between those two options on a per volume basis. This is useful because those two options don't provide the same features or performance so one might be more useful in some cases than the other. So comparing this CSI plugin to the Kubernetes interface volume plugin which it currently contains so the general idea behind current implementations of in-tree volume plugins in Kubernetes is that eventually those or at least majority of those should be migrated to CSI plugin so this is one of them since this CSI plugin needs completely decoupled from Kubernetes we have the ability to choose between mounting tools and of course there is some still some functionality left to do like volume expansion and taking snapshots so now moving on to the manual and Kubernetes so manual provisioner brings open stack manual and Kubernetes sort of closer together in a sense where users can now create manual shares from within Kubernetes cluster. What it essentially does it just maps manual shares to Kubernetes persistent volume objects and those can be then used just like any other volume type inputs currently only CFS is supported just because that's what we've been interested in during development but it can be easily extended for other types as well for authentication it supports both user credentials as well as trustees and there is a really cool use case for this Magnum. Okay so I can pick it up so in as I mentioned at CERN we have a lot of container clusters and they're all deployed with Magnum it opens that Magnum so the fact that the plugin supports both user credentials and trustees makes it possible for us to because the clusters are already deployed with a trustee user inside it means that a user doesn't have to provide any additional credentials whatever the credentials the cluster was deployed with will be you can use them to create shares to talk to Keystone to all sorts of open stack operations so that means that it simplifies a lot the setup so for us like the fact that there are CSIs there it makes it really useful so that we minimize the development therefore now this was an overview of the different components so we'll quickly jump to benchmarks so we started benchmarks with a couple of goals so we have many different use cases to use containers at CERN we just have to make sure that they actually work and then we needed also to make sure that when you define a persistent volume claim in Kubernetes and if you're doing it in GKE that it works the same in open stack so that's the manual provisioner comes in and then one of the goals we wanted is to see that the CSI driver side not really test CFFS but see that the CSI driver behaves well even when things go wrong so that's why we tested with a smaller cluster and we put a lot of load so that we would create some some problems underneath and see how the plugin reacts so the client is using Kubernetes 112 the version of CSI CFFS was 031 and we use two set clusters one is called white and it has three only three nodes with sprint drives and it's running luminous the other one is slightly bigger so it's the cluster we use for HPC and it's 300 node cluster or 300 OSDs all with SSDs also with luminous and it's hyperconverged meaning we are sharing compute and storage and then maybe want to explain a bit the methodology so what we did for the benchmarks to the actual process we had Kubernetes cluster of 100 nodes and for each of those each of the tests that we ran is that we provisioned a certain number of CFFS shares using the manual provisioner then we created deployment with certain number of replicas and those were sized as to so that each part of those replicas would be scheduled onto a separate node because that's why we that's how we tested the exact number of connections to the CFFS cluster then each of those spots would then mount all of those shares that we provisioned earlier and then we would take our own measurements so as for the actual tests we have the first batch of the test were idle where we basically did nothing with the shares and we just wanted to see how well this set up scales and as for the busy tests we made each pot unpack a large archive so in our case this was the Linux kernel source code and this was a way of doing a sort of stress test on a safe metadata servers and we stressed them pretty good as for parameters for idle tests we provisioned 100 shares per 100 replicas making up for 10,000 idle clients and for busy tests we had 10 shares per 100 replicas making up for 1000 clients simultaneously unpacking the Linux kernel. The first go at it was to start low so we didn't go straight to 10,000 idle clients we started with 10 shares and 100 replicas which would mean 1000 idle clients and the outcome was not immediately very good so we saw a bunch of errors so that Robert was happy to debug off now. So this one basically says it's a complaint for wait for attach from the CSI volume plug-in in the Kublet and it basically says that the node authenticator is not able to is not giving access to certain nodes for volume attachment objects. So what we learned from this is that we provisioning of the shares worked just fine I mean it was just 10 shares in this case some pods were up and running with those shares mounted but others that report this error they wouldn't recover. As in no amount of waiting would make them run. So here. So the next attempt was to cheat so instead of immediately fixing the problem we tried to hide it by doing a loop where we would kill the pods that would get into this trouble and try to get a bit higher in the number of clients. We knew this wouldn't be a real solution but we wanted some proper numbers so we got to this which doesn't look that nice but eventually we managed to get to 650 clients concurrent clients somehow. But it was really slow and very ugly so that wasn't good. So for our third attempt by that time Kubernetes 1.12 was out along with driver registrar 0.4 and those brought some pretty interesting features like Kublet plugin registration where driver registrar can tell Kublet to use or not use certain features. One of those was the CSI skip attach feature. So what it actually does so the way this work prior to Kubernetes 1.12 was we had basically when Kublet sees that a pod is requesting a CSI volume it will create volume attachment object for that particular volume and wait for the volume to be marked as attached. Meanwhile Kublet node number 2 is waiting the external attacher has called this volume attachment object node number 1 and it will do something with it and eventually it will mark the volume as attached so that Kublet on node number 2 can continue. This was done for all CSI volume simply because the volume attachments are in the domain of controller plugin but as you can see on node number 2 we have only node plugin so Kublet cannot make any assumptions on this so it will leave the decision whether to actually attach or not attach the volume for external attacher. Now with the CSI skip attach feature driver registrar can tell Kublet that this plugin really doesn't need volume attachments so it can skip the whole attachment process which solved the issue. Right so once we finally thought we had fixed it and actually it ended up we did we went to the maximum number of clients we had from the start the idea to deploy we did so this means that we had 100 Manila shares being created and all those shares were mounted on every single node in the cluster so 100 times 100 we had 10,000 fuse mounts which corresponds to 10,000 clients on the CFFS side. We did this gradually so we didn't want to go in one go to 10,000 but to do it in a control way so we are deploying here each 5 replicas it's 5 times 100 so we are doing 500 clients adding 500 clients at a time in half an hour we had 10,000 and this is how it looks so these are the sessions on the CFMDS the metadata server and you can see it gradually going up and it everything looks good in addition we had the cluster has three active MDSs and you can see that they are balanced really well which is kind of perfect from this plot we had a lot of other plots we have a lot of information if you are interested we have plenty of information from the CF side here we say see the namespace rates it's a bit weird that the idle clients are creating so much load on the namespace server this is something we want to investigate further same for the pull IO it's not completely clear why idle clients would do this much IO on the metadata and here we see also the amount of requests that is quite high so also the CPU load on the MDS servers was quite high the cluster was clearly under provision for this kind of test but this was the goal from the start so this is interesting for us further things to investigate we did see a couple of interesting logs so here we see that some clients from time to time are evicted from the cluster we don't know exactly why it might just be a question of the tuning of the of the timeouts for the few clients but it does mean that the CSI driver was properly reconnecting and we wouldn't see the issue but on the set side we do see this reported another thing we saw is that some could let some nodes in the cluster would flip to not ready meaning that you couldn't schedule new workloads on them this is something that it's not really understood by us right now so we'll keep investigating but they will come back to ready in a matter of seconds but this should happen now for the busy benchmark so this this was really like just clients doing a sleep and just keeping the connection open for the busy benchmark we were extracting Linux kernel and the goal was really to break something yeah this is the main goal of this exercise and we did so here is a plot again of the MDS servers the step side the MDS demons and there's something really interesting because this cluster has three different types of MDS demons and there's something really interesting because this cluster has three active MDSs but it actually has a standby MDS and you can see that one of the MDSs went into trouble and the standby picked it up immediately so from our side we actually didn't see any issue but set managed to recover as it often does but we did crash something at some point which is an achievement always especially when we have our main set expert the many of you might know that sends us messages like this and this makes our day usually this was a clearly next achievement now a bonus benchmark that we did is deleting 10,000 clients so we had started 10,000 clients here we are deleting them everything went smooth from the Kubernetes side we couldn't see any issue all the shares disappeared all the persistent volumes disappear from Kubernetes but in the Manila side we did see some of the shares being stuck in deleting and we went there and we triggered a forced delete and still in some cases this wasn't enough so we kicked the Manila share demon and this fixed issue but it's something that we will try to reproduce again to try to fix here it's the plot again of the sessions disappearing quite quickly so here we are not nice we just killed them all in one go so as a conclusion okay so yeah what can take from this is first of all there is this standard storage interface for container orchestrators which makes really a powerful tool to have storage drivers for any number of container orchestrators as long as they support the protocol but this definitely made my life easier while developing the CSSFS it works very nicely in Kubernetes although there are some issues as we saw earlier those are being investigated but that's completely understandable since this is really new piece of software other orchestrators are will fall with their implementations very soon because as I mentioned the stable 1.0 is coming up very shortly messes has already some implementation of this as for the benchmarks what we saw is that both Manila provisioner and CSSFS can handle large concurrency pretty well we saw some hiccups but those will be investigated as well and we will see how it's basically in a state where it's really usable and it's actually deployed and users can use it on a daily basis so as for the next step where to go from here there is a plan to add the volume expansion and snapshots for CSSFS and for Manila provisioner to transfer Manila provisioner into a CSI plugin because Manila provisioner is a Kubernetes specific piece of software making it CSI plugin will enable users to access Manila shares from any orchestrator that supports CSI so yeah that's yeah I think that's it so we have some time for questions so feel free to come Did you with the tests with creating the shares did you try to ramp up the velocity to create more frequently or did you stop at the benchmark which you presented here right so we you mean like the rate of creation of new clients so we started doing that to create more quickly but we didn't want to do it in this test cluster because it's really undersized for the load we were putting so we will do it in this other cluster that we have we don't have the numbers here but we will publish them soon yeah as soon as we have okay thank you and just another quick question what was the resolution of the of the graphs because I think it was 20 minutes right or didn't I get it this one is five minutes right so we ramped up the the clients over like 30 minutes 30 minutes so something like that yeah to be honest we we we don't expect the the promise is that the MDS is at the set side will scale horizontally so if we add more clients we can add more MDS or at least this is what our set admins promise so we'll be testing this okay thank you cool there's no more questions so you can reproduce the issue where you do the deletes and make Manila share hang say again sir can you reproduce the issue where you make Manila share service hang up by doing the deletes yeah so give us the logs definitely definitely so we actually managed to reproduce the issue on both clusters so we we have collected everything to send it upstream yeah okay thank you very much thank you very much