 Awesome. OK, hello, everyone. Thanks for coming along. My name is Arne Wieberger. I'm from the CERN Cloud team. And today I'm going to summarize a little bit what we have done in the past weeks and months in order to prepare Manila and CEPFS for production at CERN. The CEPFS service is not run by the OpenSec team itself, so I was working closely with my colleague, Dan, in order to put together the material for this presentation. Before we talk about Manila and CEPFS, I would like to briefly introduce CERN for those of you who would like to know a little bit more what that is. So CERN is a European organization for nuclear research. It was founded in 1954 and has, at the moment, 22 member states. The acronym CERN, however, is very often also used for the laboratory itself. So it's the largest, the world's largest laboratory, particle physics laboratory. It's located at the border between Switzerland and France. There are about 2,300 people working there, serving 12,500 users. And we have a budget of roughly 1 billion Swiss francs per year in order to follow our primary mission, which is to find answers to some of the fundamental questions of the universe, the questions like, what is matter composed of? What happened right after the Big Bang? Why is there an asymmetry between matter and antimatter and so on? In order to do this, we have a tool which is the Large Hadron Collider. So it's a particle physics or a particle collider that's built underground in a tunnel that's like 100 meters roughly underground. You can see a schematic of this on the picture. So here you look actually from the French side, from the French mountains, onto Geneva and the Lake of Geneva. You see the dotted line is the border between the two countries. And you see where the accelerator is roughly. So the accelerator has basically two beam pipes where particles are circulating before they collide. So the circumference of this is roughly 27 kilometers. And there's a lot of things to say about this machine and what it all can do. One number that I find particularly impressive is that the particles before they are collided, they are accelerated, and they do the 27 kilometers 11,000 times per second. So they are very close to the speed of light. It's very high energy. They collide it and then at the interaction points we have detectors actually record what's happening and then the data is analyzed in order to understand the corresponding physics. I don't have time to talk about this very much. If you're interested, I invite you to visit our home page where you find a lot more information about this. Home dot CERN. At CERN we run an OpenStack cloud since July 2013. We have, of course, upgraded the cloud several times. We're currently mostly on Newton for most of the components. The cloud spans two data centers. We have one which is located at the main site, which you just saw in Geneva. And then we have a second data center in Hungary, 23 milliseconds away. We do this by having only one region in order to provide one API. So for the users, it's mostly transparent, actually. The cloud has, our cloud has, at the moment, 220,000 cores. We will add another 80,000 roughly in the coming week. So it's a little bit bigger. So we hope to just cross 300,000 cores. This is done on 7,000 hypervisors that we managed. And the additional cores will come with the additional 2,000 hypervisors. At the moment, we have 27,000 VMs running in that cloud. In order to get to this scale, we are heavy users of cells. So cells version one, we have more than 50 cells where the various hypervisors are organized. And we use this also in order to separate the different types of hardware that we have, different use cases of services versus compute use cases. There are different cells that support the power feeds that are different, the two locations, and so on. So this is very briefly in one slide of our cloud. Now, as this is about storage, CERN does a lot of things in storage. So also here, I cannot talk about all the different storage systems that we have. We have basically everything from tape robots over disks to SSDs in order to help our users to achieve their whatever they want to try to do. The flagship project or the flagship storage system that we have at the moment, it's an in-house development. It's called EOS. It's mostly used for data analysis, which stores or has a capacity of the moment of 120 petabytes on 44,000 spinning drives. What we're going to talk about, though, is this area. So we have CERN as the main or as the back end for Cinder. And in addition, we have something which we call the NFS Filer service, where we have NFS users. So users that cannot use one of the other storage systems that require NFS access to the data. And we're going to focus on these two. Now, the NFS Filer service is also based on OpenStack in the end. So it's NFS appliances. So we have multiple VMs that attach a volume, use ZFS as the file system, and then it's exported to users. The second data center is used for ZFS replication for disaster recovery. We use local SSDs as an accelerator for the HLO arc or the ZFS intent log. And as I said, this is for users that need strong consistency and posix access to a storage system. So we have users on there like Puppet or GitLab and various other applications. This service, however, has some limitations. So if you have a single NFS server, as we have for these use cases, there's some limitations that you hit, like the metadata operations. So this is, for instance, a plot of the metadata operations. We see that it's like 20,000. And then there's a bump, we believe, when we upgrade it from Puppet 3 to Puppet 4. So Puppet is one of the main users of this. Another limitation that we have is the availability, because if you have a single VM that has a single volume mounted and there's something wrong, you may have issues with that access to storage. And what the users really ask for, they often ask for some kind of shared block device that is available on multiple instances. And in addition, we have emerging use cases, which are HPC use cases, where I have an extra slide for this. So this is mostly, this is not our core data analysis data path or workflows. But this is where we do beam simulations, accelerator physics to understand how the machine works and how it needs to be tuned, plus simulation, QCDs, or quantum chromodynamics. So the theorists are working with MPI applications in order to do their computations. And as I said, that's slightly different from the usual HTC models of the high throughput computing that we do where we shuffle the data and through a batch system and on to analysis. So it's really high performance computing where we need a low latency cluster. We have dedicated clusters for this. We have the need for systems that can support jobs that run for four or eight weeks. And these applications need access to shared storage, for instance, in order to store the temporary state. And this is now what NFS can support in our environment. So what we did last year, middle of last year, we set up a dedicated stuff of S cluster for this. It's relatively small with 150 terabytes. It's also relatively low activity, but it was the first try to actually get SFFS into CERN and to see if SFFS could actually address the need for a distributed shared file system. Now, the question that we had is, can we converge these two use cases that we have? So the HPC use case and the NFS use case into both onto SFFS and basically get rid of the NFS file service to consolidate things a little bit. And if yes, how do we make this available to the user? So how can we manage users that want to share here or share there so it's not possible that we do this by email or something? OK, so this is why we're looking into SFFS and Minilla. I will start with the SFFS backend. So very briefly, for those of you who don't know what SFFS is, so SFFS is the POSIX-complied shared file system that's on top of the stuff rados layer. So it's the same foundation as RED, which we already use at CERN since multiple years. So we know that it's working and very reliable. There are user land and kernel clients available for SFFS, where the user client usually gets features first and then they go into the kernel client later. The current production release is dual. So in dual, SFFS was tagged as production ready. And the main addition at the time was a file system check that would allow an operator or a manager of the system to actually detect if there's something inconsistent in the file system. So it was almost awesome before, but the focus was more on block and an object to get these foundations rock solid and then now the focus is also on SFFS. Now the main addition that comes when you go to SFFS is the SFFS metadata server. So as I said, we already know that the foundation works, but the MDS now becomes the crucial component to build a fast and scalable file system. So the metadata server basically does two things. It creates and manages the inodes, which have been persisted in the underlying object store, but they are cached in memory. And it tracks the client inode capabilities, so which client is using which inode. So if you have a larger cache on your MDS, this can help with your metadata speedup and more RAM on the MDS can avoid that you have to wait when you have to read and metadata from radars. However, if you have like a single MDS, of course at some point this becomes an issue, this becomes a bottleneck. So we need it or we need multiple MDSs for scaling. It's maybe interesting to know that the MDS itself keeps nothing in disks, so having like SSDs on the MDS doesn't really help with anything, but if you have a flash-based writer's pool that may help with metadata and sense of workloads. So the testing that we have done with MDS is like very simple basic checks like POSIX compliance. So we used the Tuxera POSIX test suite for this. It came out okay. That my colleague Dan has written a tool very similar to Ping that checked the something like a, the consistency delays that you see when you have two clients. It's like Ping with the file system in between if you like, so that seems okay as well. We've seen some slowdowns that we can reproduce if we have multiple clients writing to the same area where we're in touch with upstream to understand what that is. Now in order to understand if this can actually do what we want, is we try to mimic what the Puppet Master does. So if you remember, the Puppet Master is one of the main users that we have on NFS. So we basically took a copy of the current files that the Puppet Master has and basically run a massive find from one or multiple clients on this. And what we saw is that we hit a limit at roughly 20,000 stats per second, which is enough for what we need if you remember the graph that I showed before, but it's just okay. So for this we also need to scale out a little bit more. The other thing we tried is like having tried the failover. So if you have like a second MDS and you want to failover with this, also with multiple active MDS system with luminous release, this all seems to be okay. So from our testing it's all fine. The only thing that we found is like when you have data that is accessed that may move around between the MDS is where you would expect that it stays there but also there we are in touch with upstream to understand what's happening there. So we have found a couple of issues with CephFS. I won't go through them, through all of them. You see already that most of them have been already addressed. Two things that probably we would need are in the areas of quotas. So at the moment there are no, there's no quota enforcement in the kernel client and on the Fuse client quotas are basically advisory or like you need to have basically clients that behave in order to not fill up your cluster or bypass the quotas that you have. The other thing is some throttling or QS which allows us actually to protect the cluster in case there's a user that launches a batch job with 10,000 clients that then start to hemocephFS and no one else can get access to the system. So these are the two things that we see we would need but in principle it passed all our testing and it works where we are. So CephFS is awesome. The POSIX compliant looks very good in months of testing. So we have the, as I said, this HPC use case on there since middle of last year. It actually, we didn't have any issues with this, no difficult issues at least. Quotas and QS I mentioned. With the single MDS that we have at the now it's good enough for use case but we will need multi MDS and if you've been to say just talk earlier this will come with the luminous release and we are already testing this. What we haven't looked at yet is backup. So how do we actually back it up? Or NFS re-exports for our legacy clients or for clients that need Quotas. As I said, luminous testing has started. So CephFS is working very well for us. The second part is like how do we integrate this now in our OpenStack deployment and how do we hand file shares up to the users? And this is why we started looking into Manila. Now before I talk about what we did with Manila you may have seen in the news in end of April last year that the LHC was suddenly shut down and this is the front page on the website of Le Monde which says that a fouine has caused the LHC to shut down. Now a fouine is the French term for diesel. So it's this. So what happened actually is that this diesel got into one of the electrical power stations and caused an emergency shutdown of the largest machine the mankind has ever built. Now you can imagine that when I started with Manila and I started with a project that has this essay as a mascot you can imagine what kind of jokes I got, right? So whether I'm trying to shut down a whole of IT, yeah. You will see how close they came to the truth in a second. So for the Manila overview for those of you who don't really know what Manila is so Manila is the file share project in OpenStack. It allows to provision file shares to virtual machines. It's like Cinder for file shares or Cinder for NFS. So it's very similar to this concept. Clients or tenants can request shares to be created. These will then be created by backend drivers and then the client can access such a share from an instance mounted and access it from there. The support for a variety of protocols for us was of course relevant that CERF-FS is supported and also it supports the notion of share types. So we can have different types of shares which allows you to map things to different backends which is also something that we heavily use in Cinder for instance. Roughly the service looks like this or the components in the service look like this. So you have an API component that receives requests of just the authentication and then handles the request in general. You have the scheduler that routes the request to the appropriate share service and you have this share service at the back with the driver that actually manages the shares themselves creates them, deletes them and so on. In addition, you have a message queue to allow for the communication between the components and you have a database where as usual the metadata information is stored. So it's not complete. So the metadata data service I haven't omitted because we didn't use this. So when we started doing this actually it's pretty convenient if you already have a cloud. So what I did is I created three virtual machines. I started the API schedule in the share demon on all of them. I had a separate rabbit cluster for which we have puppet modules already for all the other components that you saw early in the list. So this is all more automated. We have a service that provides your database. So actually what did I do? You know, I didn't do very much. And we use the existing surface backend. So if you deduce the time I needed to set up rabbit which is as I said puppet and request the database or so we had something working in less than one hour. It was really, really easy to set this up. Well I say working because some of you may notice that we have now three VMs and all of them are under share service which is not what you can do and we found out particularly because the Cephifes driver actually when it's launched it will evict all the other clients that I tried to use with the same authentication identifier. So we were wondering why there's always only one and well I was immediately talking to upstream and I thought they found a buck but the buck was actually me because so the way it should run is like this. So you should have only one share service that's talking to your backend. Now funnily enough, I put this very small here. I don't know if you can see this. I also changed our Cinder setup which was following the same wrong approach. I had multiple volume services that weren't talking to Cinder and could also lead to some issues. Then once we had this setup, we started testing. So the first thing we did is like we created and deleted shares. Sequentially, very slowly, looked all good, it takes a couple of seconds to create a share, deletion is a little bit longer. One of the things I saw that I haven't followed up yet is that there are like more authentication calls that I would think there but that all works. Next thing I did is what, bulk deletion of shares. So if you have like many shares and you wanna delete all of them, you can either do this sequentially. So you go like Manila delete A, B, C and so on. That works but you can also do this in parallel. So you can say Manila delete A, B, C, D and just spawn it off and say like okay delete all these shares and it both works. So how did I come up with this test? The thing is that we have an issue with Cinder for instance where when you do this, our Cinder basically has issues and this is what users really do when for instance they have a Magnum cluster with 50 nodes and all of them have a volume attached and then you delete the Magnum cluster and it goes to heat and heat says okay delete this cluster. It basically launches this and this affects our production service. So Manila did fine. When trying to debug this on Cinder, I was actually looking at the code. It looks very similar to me so I'm not really sure why it doesn't work on Cinder yet. Okay, so after I have done these very simple tests and done this for 24 hours in a loop like creation, deletion, all the time and it all worked fine, I thought we can get a little bit more serious. So the next thing I did is I started something that is called the Fween Hammer. Remember Fween is the French name for visa. So I used with the help of my colleagues to set up with our Magnum deployment a Kubernetes cluster, okay? And then in this Kubernetes cluster I had parts and these parts would actually try to talk to Manila, okay? For instance Manila list as a simple example. So this is the Fween Hammer. And then I tried to scale things in order to break it. So this was the idea. So it should just give you an idea of how that works or how that looked like. I hope you can see this. You don't have to read. It's just, so on the left-hand side you see like three tails of the Manila API servers. So the three servers that I have. And then on the right-hand side I will now create a Kubernetes cluster. So you see kubectl, there's one part at the moment. And then you see, so there's Manila list in the loop and you see how the API start to work. And then there's a nice tool that's called Cubetail on the right-hand side that actually gets the output from the parts and says what it sees. Okay, that's all very fine. It's all nice. I should probably stop that for a second. So, oops. All right, that's the run. So the main point here is like, once you have this all in set up with Kubernetes, it's very easy to scale it, okay? And very dangerous. So what happened is that, we know at the point again where you see the output from this part in the top right terminal. And it says Manila list, there's one share and it does that as fast as it can. Now in the bottom right corner, I open the YAML file that describes the deployment and I go from one part to 10 parts. That takes me two seconds. So it's really sharp knife. And you will see that the APIs get more busy suddenly because now there are 10 parts, okay? And then on the top right corner, you see like color coded the different parts and what they actually do. So you can imagine what happened next, right? So if you can do one and you can do 10, you can do 10,000, right? So you cannot. Okay, I will come back to this later. Okay, so here's basically what I saw. So when you scale the number of parts, and this is the number of requests, so it's trying to stress the APIs and see where they break basically. So if you, in the small graph, you basically see zoomed in the area on the very left. So you basically see it scales linear with the number of API servers. So I had three nodes, four cores, each one API, so it's 12. So up to 12, it basically skates linearly and then it kind of stops, which is kind of what I would expect. So that looks okay. The part can of course do more complicated things rather than lists that can create and delete. So I had each part was creating and deleting all the time and then you create a lot of parts. And this is how it looked like in our dashboard. So what you see here is basically, so I had one part, 10 part, 100 parts, then I went for lunch, made it back to 10 so that I don't disturb anyone, and then I went to 1,000 and you see how the request time is affected depending on how many clients are actually talking to Manila. So you should notice if you have 100 parts trying to vary the API in parallel and the request time goes through the roof if you really scale it out. And if you really exaggerate, you see something on the bottom. It's like the number of errors that you see in the logs when it basically hits limits like the database connections, okay? So works fine. We don't have, I don't know, 10,000 clients that try to create shares at the same time. From our experience with Cinder, you have like one every couple of seconds or something. It looks very fine. I couldn't break it. I was a little bit disappointed there, but it just works. Now, not everything works. So the Fuyen Hammer actually created some trouble in other places. So right away when I created this Kubernetes cluster, I was, okay, was my first Kubernetes cluster, okay? So I thought I could just do it with 500 nodes. And then we have an internal chat system and I could see that someone posted like the registry is down. And I was thinking, I was thinking, oh, that's unfortunate. Wait a second. And then it's like flush, it's like, okay. So yeah, I put something on the chat saying, yeah, this is probably me. So the colleagues were very kind, actually. So that was right away, I broke it. But then if you scale further, you hit other issues like the DNS, so the DNS needed to be scaled so the DNS within Kubernetes. At some point, the amount of monitoring that you create over flooded or elastic search system, and also there, I didn't realize it was me. It was pretty funny for several days I was complaining to them that this doesn't work, whatever, because I tried to create these graphs and see this all in the dashboard, but the dashboard was empty because there was no data and there was no data because I broke it. Okay, so it's like a hand and neck problem. And then when you have a couple of thousand parts, at some point you hit DB connection errors on the DB side where they limit how many clients can actually connect. So it's, you of course then see errors in Manila, but it's not Manila itself, it's just the database backend. Okay, so in the end, what we have now is a kind of pre-production setup with these three controllers still. We have three back ends all on Ceph and different releases and for different purposes. So we have a production cluster where we have the HPC use case, which we created their share through Manila now. We have a test backend where users come and test Manila if that works for them in CephFS. And then we have a dev cluster where we test Luminos at the moment. So this is basically what we have at the moment and that seems to be working. Okay, now one of the first users that we have actually on top of HPC is one of the experiments. So they are actually using this. They have built an infrastructure which allows them to dynamically put together an analysis chain depending on the outcome of calculations that they do and data they analyze. So they do this as directed graphs, directive graphs without cycles, which are built at runtime, okay? And the individual graph nodes are containers. So I'm not really an expert in this. In the end what they use is they use Kubernetes and Magnum and Kubernetes and all to put this all together and they're using CephFS to store the intermediate stages of the jobs, okay? So they have all these containers and then they have this shared space where they basically store things. So that's working very well and it's working very well also because it leverages the integration that you have of CephFS and Kubernetes, which is really nice. So there was basically nothing for us really to do because it's all there, okay? So to conclude on Manila, Manila's also awesome, okay? It's working very well. The setup is really, really easy. We didn't find any major issues. We tried some functional testing, key generation creation, deletion of everything. That worked all okay. Some of the features that we found were missing, we talked to upstream, like quotas that you have per share type, so that you actually can control that a tenant can create only so much or take only so much space on a specific backend rather than having a global quota and you can't control where that uses and space is actually used up. And some of the other features, like having really multiple share services are in the plans at least. Also we're very interested in the CephFS NFS driver to actually have something that re-exports CephFS as NFS because some of our users would like to have NFS because this is something they know it's available everywhere. There may be issues with the licensing and warranties and so on. So that's something we're very interested in. There's some issues that I think need a little bit more attention. So for instance, the client integration because what we have, so our users, we try to like move them to the open stack client. So we say, don't use nowhere, don't use Cinder directly anymore. Use the open stack client, but then if you have a new project that you're trying to get users used to and then you have to tell them, okay, for this project you have to use a container or special machines where you use this, that's giving the wrong impression of the state of the project, I think. So that would be really nice. There's some minor issues that we found where you can do things on the UI. You can do things on the CLI that you can't do on the UI that may be also confusing for users. One thing that's maybe specific for CephFS is that with the last chair that you delete, your authority also goes away. Yeah, that can be easily fixed. I'm not sure how much users like this or not. And what I really would like to point out that the Minile team was really, really helpful with like helping and understanding how that all works and helpful with adding features that we needed. So it's really a nice help for team that's very much appreciated. And I'm basically done with the look of the dashboard that we have, so this is the dashboard of the production Minile instance that we have. It's still pretty small and there's just artificial load at the moment, but this is basically how I check that everything is fine in Minile. I'm happy to take questions. Thank you. So the CephFS test cluster that you talked about in the beginning of your presentation, I'm assuming that you used XFS as a file store or did you venture out into the BlueFS line? No, no, no, it's XFS. And did you do any testing with respect to RDMA or was it all STCP IP? Yeah. So for luminous testing that you have planned for future, do you plan on testing some RDMA capability? Actually, I'm not the CephFS expert and I respect that I'm not really sure of what he really plans. I was chatting to him during the Sages talk about BlueStore and how we introduced this. And he's very interested to get this into our cluster, but for RDMA, I don't think so. Okay, thank you. So the question is, I'm very interested in NFS as a re-export for CephFS, so how do the clients connect at the moment? So at the moment the clients use, it depends, like as I said, there's a kernel client and a userline client. So the Kubernetes clients at the moment that we have used the kernel client, everyone else we ask to use Cephuse in order to connect. Okay, so we have a description of how to do this manually and we have also something in Puppet that actually just allows the user to say, okay, this is the volume path and we have a secret store where the user has to upload the key, the authentication, the key, the CephFS key with his authentication identifier where it's basically pulled from from the Puppet module and then with the information that the user gives for the volume and it's, well, it's like four lines in the Puppet manifest that will allow a user to actually integrate this and this is also then Cephuse. So we're a little bit worried about the quota issue really because if you have no quotas and you have a client that runs off at 3 a.m. just filling the cluster, this is something that is, yeah, something we're worried about. So this is why we're asking Cephuse for Cephuse and then for actually respecting the quotas. About the clients, do they have to talk directly to the OSDs for their data? Sorry, so do you have to open up the OSDs to clients or went to clients? Yes, so with CephFS, of course, you have to kind of trust your users so the clients basically are on the same network as the Ceph cluster, that's the question. How important are mandatory quotas versus just having quota support in the kernel? In other words, the fact that you rely on the client to cooperate, I mean, you don't have malicious users presumably, you might have users that make mistakes but they're not gonna change out the CephFS Cephuse client, right? So I was curious on the slide, if somebody thinks about this, you know, working on this, it seemed to me offhand that the issue of the Cephuse client being co-operative was less important than having the kernel client get to respect quotas at all but maybe I don't understand. Well, for us, it's basically, I mean, for us, having no quotas in the kernel client stops us a little bit from telling Kubernetes users to use CephFS, right? So this is a little bit of an issue. I mean, usually we don't have malicious users, as you say, so it shouldn't be an issue and we also try to compartmentalize things so if someone goes crazy here, it shouldn't affect everything else. So it's, for instance, in different Ceph instances or in different pools and so on. So, but yeah, it's something that, like from an operational point, we would really like to have quotas. Any more questions? No? Okay, thank you very much.