 Good morning, everybody. Nice to see you here. My name is Enrico. I work at CERN in Geneva, Switzerland. We are a big science lab. We are actually operating the biggest particle accelerator in the world. And my talk today is about improving business continuity and disaster recovery for the self-infrastructure that we operate on site. So before getting down to business, the picture that you see in the background here is actually representing the LHC. This is the large hard-run collider. It's our particle accelerator. The four names that you see around are the four biggest experiments around it. They continuously collect data, and there is a beam ongoing to then reconstruct physics events and understand interesting things about how the world was actually started. So this is a picture of the LHC tunnel itself. It was built 100 meters underground. It has a diameter of 27 kilometers. And this blue pipe that you see here, it's actually filled by superconducting magnets. We have two beams, which are two beams of particles which are accelerated through the superconducting magnets. And then they are steered at very precise places where we have the so-called detectors. This is a picture of the CMS detector in a super-simplified way you can think about it as a gigantic camera. Of course, it takes very special pictures. It's a bit more expensive and a bit more heavy than the traditional camera that you may have with you today. But it's basically it. And there are some very specialized detectors that measure different features and characteristics of the collisions to then reconstruct via software what has happened in the world of physics. And this is actually one of the first pictures that came out in July last year when the accelerator was restarted after a long shutdown where all the equipment and the detectors themselves were upgraded and improved. So beside physics, we also do quite some IT at CERN. We have a pretty old data center which is located on the main campus. This was built, I guess, in the early 70s. And we're actually getting a new data center in a second site of our lab, which is being built at this very moment in time. It should be operational by the end of this year. It's a fully new building. It should have three floors with new hardware in total capacity, which is 12 megawatts in power. We will not use all of that from the very beginning, but it's expected to be up to the computing needs that we will have to face during the next run of the accelerator, which is called high luminosity and has very high demands in terms of storage and computing power. This is a table that shows roughly the CEPH clusters that we operate at CERN. As you know, CEPH provides block, object, and file system storage. We use all of those in different ways. We are not on Quincy yet. The main production version that we use is Pacific. We also have some clusters on Octopus. We are actually planning to upgrade everything to Pacific and Quincy by the end of this year. If you sum the total size, this gets to roughly 60, 65 petabytes. With the new data center, we are actually buying new hardware. So by next year, we should be above the 100 petabytes bar. We use CEPH for many different things. Just to make it clear from the very beginning, we're not using CEPH to store the actual physics data. This is stored on a different system, which is called EOS, and it's roughly half an exabyte in size. But we use CEPH for general IT services. The most important integration that we have so far is the one with OpenStack. Because OpenStack, together with CEPH, they team very nicely together and they build our computing infrastructure. It has been extremely valuable and practical for our users over the years because they just go to OpenStack. They can spawn a VM. They can get some block storage, some file systems. Everything is fully integrated. The services are delivered in a self-service manner, so everything is fully automated through OpenStack, and it's extremely valuable for our user community. Then we're using it also for many other storage needs that we have across the laboratory. We use it within my group, which is the storage group in the AT department to deliver some other storage services, for instance, AFS or some NFS filers. But we also see more and more use cases coming directly from the physics world to us, and for some reasons they are particularly interested in using S3 object storage. If you want to know more about how we use CEPH at CERN, there will be actually a talk by Dan that I see here in the first row later on today, 20 minutes after two, so you may be interested in learning how CEPH was successful at CERN over the last 10 years. So let's get down to business. The purpose of this talk is actually to walk you through the journey that we did to test and validate many different features that are implemented both in OpenStack and in CEPH to improve the business continuity and the disaster recovery that we have. The goal that we have in mind is to collect evidence to then drive decision-making processes. We didn't take any final decision yet on this, so I'm very happy to talk to you and discuss technical aspects if you're facing similar issues. This is the outline of the talk. I will cover block storage, object storage and file system. It's actually likely that I will top up to my 30 minutes, so this is the last talk before the lunch break. I'm very happy to talk to you during the lunch break and discuss. So block storage, this is the first service that entered production at CERN in 2013. I was not even working at CERN yet those days. As it always happened, you start with a cluster, then the storage demand grows, you scale it out horizontally, you cycle the hardware through different generations. It all worked very well. The aspect that we're trying to improve now is mainly related to the fact that CEPH is extremely reliable, but a CEPH cluster by itself, it can still be a single point of failure. If you have some corruption or data loss in some of the data structure of the CEPH control plane, this can severely impact the operation of your cluster. It's something that unfortunately we have experimented in 2020 because of a compression bug in the LZ4 library that corrupted the OSD maps and brought down all the OSDs for almost eight hours. Since then, we have been trying to reshape the way we do RBD. We have actually instantiated different clusters to move from a single monolith where everything is stored to actually five different clusters, so to reduce the impact that one cluster being unavailable has on our operation. I do not want to go too much into the details, but instantiating new CEPH clusters and provisioning additional storage is relatively easy. If you just buy the hardware that you need, what is actually a bit more complicated is to expose the storage that you have available to your users and other service managers in a short of a way which is consistent and they understand what they need and they understand how the applications should be deployed. We weren't very successful at the very beginning in doing this because we were just instantiating a new cluster, new volume types, a second cluster, again new volume types, and at the end, these has led to a proliferation of volume types and users were quite confused into understanding what they were actually looking for. We thought again about this and we actually have consolidated the volumes that we offer according to the quality of service. More specifically for the standard and IO1 volumes that we have, we have introduced the concept of storage availability zones. There are availability zones in computing in OpenStack. We are trying to mimic the same pattern also for storage. When you ask for these volume types at CERN, you can actually land on three different CFRBD clusters which are fully decoupled one from another. They are actually physically placed in different rooms of the computing center, different power feeds, different UPSs and network branches. If you don't care where your volume is going to be stored, then you delegate the choice to Cinder that has some internal weighting functions to decide how this should be done. Otherwise, if you know how you want to deploy your application and you know that that volume has to go on one specific availability zone, you can still do as you see here in the example by specifying the availability zone that you want to use. This has helped us in reducing the impact of one cluster being unavailable. The second thing we're looking into is actually backing up RBDs or have some sort of active passive standbys. CEPH provides this feature. It's actually called RBD mirroring. There are two different modes of operation. The first one is journaling. The second one is actually based on snapshots. We started with journaling because it's fully supported by OpenStack out of the box. Unfortunately, this exposes the RBD clients to the so-called double writes, which has quite an impact on performance. As you can see from these two plots, you have the bandwidth that you're able to achieve or the number of IOPS with different block sizes, so small writes and big writes. The blue bar is when you don't have the journal, so full throttle, full performance, and then the other two, red and orange bar, are when you are using journaling and you have the journals either stored on spinning disks or an SSD. So small writes are pretty much dramatically impacted when they are on HDDs and the same is when they are on SSDs. The bad thing is that small writes are not so much impacted when the journals are on SSDs, but they still suffer. The big writes, they still suffer. So we consider that this was price too high to pay, and we actually looked into the other mode of operation for mirroring, which is actually based on snapshots. This works very nicely out of the box. The performance impact is actually very limited and it's only related to snapshot trimming, so it's fully contained in the back end. The snapshot is not, sorry, the client is not exposed to to the complexity of this mode of operation. We were primarily looking into pool-based replication, so it's something that we're keeping in our pocket in case we have to put in place some purely SAF-based disaster recovery mechanisms, but what we also want to offer is a backup and restore service that users can manage by themselves. And we are actually trying to do this with the OpenStock Cinder backup drivers. We have looked into two different drivers, the one using S3 as a back end and the second one using SAF-RBD, because we operate these technologies and we are familiar with them. I don't want to go too much into the details. This is a performance measurement of the time it takes to do backups of images that can be completely empty, 50% used or fully used when you're using the S3 driver. The implementation is actually quite inefficient because what happens under the hood is that the driver goes, it reads the source image in chunks, then it compresses the chunks and does the upload to S3. All of this is done sequentially, so it actually takes quite a bit of time. It takes six or seven minutes to backup a five gig image. There are some parameters that you can tune. None of those is really a game changer and anyway you have to think about the trade-offs, meaning you have to decide if you want to do your backups fast or if you want to be efficient in terms of storage compression and de-duplication. What actually works very well is the backup driver that uses another SAF-RBD cluster as a target, because the driver itself actually uses low-level SAF features and libraries like LibRBD. The performance are very good out of the box. You're able to do backups at 140, 150 megabytes per second. The same is for restores. The speed is consistent throughout the whole transfer. As you can see here, we are backing up images from one gig to one terabyte and the time it takes is proportional to the size of the image. What is also very important for us is that this case very well horizontally with the number of backups that you're doing at the same time. So you're actually limited by the network bandwidth that you have and not by CPU or memory. Incrementals are also very efficient. Again, this uses SAF low-level features, so the time it takes to do an incremental backup when you're changing a known amount of data in your source image, it is somehow related to the percentage of data that you changed. It doesn't precisely match the amount of bytes that you have modified, but as you can see there is quite of a trend which is consistent with the changes that are applied. Okay, we move on to object storage. We operate object storage. We have mainly two clusters at the moment. The first one is the main production cluster. It has the data pool stored with rager coding 4 plus 2. The bucket indexes are still stored in replica mode. We actually have some Rados gateways with a traffic front end which does nice things for us including TLS termination and health checks against Rados gateways. What is also very important for us is that we are able to do some pattern matching on the bucket names and therefore we have some privileged services that we are routing from traffic to dedicated Rados gateways. This is for instance the case of GitLab for the code repositories and the container images or CVMFS which is another service that we operate for the distribution of software at a large scale. We also have a second S3 cluster which is physically located in another place. The two clusters are completely decoupled, one from the other. This one is not a second region or zone maybe yet and here we have 25 petabytes of storage which we are mainly used already today for doing backups. One thing that we have tested in the context of object storage is the multi-side configuration starting with the traditional full side replication between the master and the secondary zone. We have tested this on Quincy and we have benchmarked what was going on with a tool which is called warp by Minaio. The tool itself is not specific to multi-side deployments but it's actually quite useful if you want to hammer an S3 instance and see what happens. There are some pain points. The major concern that we have at the moment is that we have observed that synchronization between the main zone and the secondary zone may lag behind and struggle to recover. One day we have deliberately decided to shut off the secondary cluster and write one million objects into the primary cluster and they took quite a long time to reconcile and replay all the changes on the secondary zone. For what concerns bucket indexes resharding is also pretty much stressful. The cluster is able to reshard happily on the main zone but then the secondary zone gets somehow disconnected so if you're actually trying to list the content of your buckets this doesn't work out of the box. There are some manual procedures that you have to apply. You're able to bring this back but it's not something that I would do very happily or regularly on a production multi-side deployment. The last thing is that actually multi-side in Rados Gateways it's implemented as a sort of a FIFO queue which means there is a replication delay between the main zone and the secondary zone. The delay itself is non-negligible. It can be up to tens of seconds as you can see from the plot here. The blue line shows the propagation of deletes and the yellow line shows the propagation delay of puts. So you still need a front end as we are already operating with traffic to somehow make sure that your clients are always directed to the right cluster because you may end up in situations where a client does a put and then immediately after it does get or head and it goes redirected to the secondary zone. The object is not there yet it gets a 404 and this is extremely confusing. It's pretty easy to do this with HA proxy with traffic and I'm pretty sure with many other front ends it's not a safe problem in a strict sense but it's something to be taken into account especially when you have to fail over and then the main cluster comes back again because your HA proxy would say okay the main cluster is back but if in the meantime you have been writing objects into the secondary zone these are not replicated back yet into the main cluster. We are also looking into S3 for some disaster recovery ideas that we have in mind. There are some features which are super valuable for us for instance immutable objects. You may want to be able to write to S3 and never be able to delete the object again or maybe delete the object after a given amount of time. You can actually do this by combining versioning and object locks. There is very good documentation around these online that you can read about. Another feature we're very interested in is the so-called archive zone so you can set up a zone in your zone group which works as an archive and it basically takes responsibility for restoring all the object versions that you have on an S3 cluster. It's mainly meant to store the previous versions on cheaper media for instance if you're running your main S3 region onto SSDs the versions can happily go to HDDs. We are running the main region on HDDs at the moment so it's not something that would make a huge difference for us at the moment but it's definitely something that is valuable and we may want to use in the future. Last topic file systems so we use Cephaphase also our main Cephaphase users are actually container-based workloads running either on Kubernetes or on OpenShift. Here over the years we have tried to apply a similar approach to what we do what we did with RBD clusters. We had one big Cephaphase cluster which is still there but in the meantime we have created several other clusters one which is fully on Flash and some dedicated Cephaphase clusters for some custom use cases that we have for instance HPC. We do use multi-active MDSs. In some cases we explicitly pin some directories to one MDS because we know that that user is very demanding and and therefore we allocate pretty much dedicated resources to that use case. What we don't do at the moment is snapshots even though we would really love to use those. So what are snapshots? Snapshots are precisely what you think. They are any mutable view of a file system at the given time. They are extremely easy to use on Ceph. You just enable those globally on your Cephaphase cluster and then you just create and delete snapshots by creating and deleting directories into a special folder which is .snap. This is something that the administrator can do also the end user can do so it would enable several options for instance you could you could retrieve from the very same mount previous versions of files or files that were deleted. This is a point-in-time consistent view of the file system so it's very valuable if you're doing backups maybe using external tools or maybe even using Ceph itself. There is a Cephaphase mirroring capabilities and a mirror demon which is precisely responsible for copying over Cephaphase data from a source cluster to a target cluster in a very efficient way and all of this is based on snapshots. Unfortunately snapshots do not come for free. There is quite a performance impact and this is visible if you have very demanding metadata workloads. In this slide you see two cumulative distribution functions of the time that it takes to untar the linux kernel the traditional tarbol that we always use for these kind of tests and then later remove all the files that were extracted from the tarbol. We have repeated the test I think a hundred times or 150 times and okay I'm not sure if the colors are are very visible there is a there is a green line on the left hand side of the plot which is the time it takes when snapshots are not enabled and then there is another red line on the more on the right hand side which is the time it takes when the snapshots are enabled and basically the space that is between the two is representative for the performance penalty that you pay if you're using snapshots. The problem seems to be linked to the seems to be confined to the metadata server. It's also known upstream. It seems that the MDS is pinning on unlink operations through several cycles. We are thinking if it is possible to work around this we think it's actually possible for instance you may say okay I'm running a multi active MDS setup so you're pinning the the directories that needs snapshots to one MDS and those are the one that suffer in quotes and all the others just go to another MDS for instance if you're using CFFS as a scratch space you do not need snapshots and and they can just take advantage of the whole performance that Ceph is able to deliver. This is still to be fully understood before being deployed in production and in all cases it's pretty much helpful if you have a cluster where everyone wants snapshots because everyone will suffer at the end of the day. We would like to use snapshots and the mirroring functionalities we are not very confident at the moment to enable this so we're looking into other strategies to make CFFS backups. We are actually developing a tool in-house which is called C-Back. It stands for CERN backup. It's actually built on top of RESTIC. It's not written from scratch. The main difference is that in C-Back you have stateless agents so you can scale horizontally as much as you want agents which are responsible for doing the backups or for restoring from backups. The source of truth is a centralized MySQL database where everyone reads or updates the states of a given backup and we're already using this quite extensively. We do 40 000 backups a day and we have roughly six petabytes of data backed up through this machinery on S3. We would like to somehow do CFFS backups via OpenStack so via Manila because it would be very good for us if we can provide to the end-users something that is very similar to what we are trying to achieve for RBDs and Cinder. Manila currently implements a share replication functionality which is mostly meant for high availability. There is a backup driver for Manila being discussed. I understand that this is mostly based on R-Sync because you are expected to have access to the source file system and the destination of your backup and basically copy all the files over. There may be ways to do this in a more efficient way for instance using the mirroring functionality of CFFS or if possible for our use case try to integrate C-Back into this. We are developing APIs around it so Manila could potentially just go and query the state of a backup by making a call to C-Back. So wrapping up we have explored many things. There are many things that one can do OpenStack and mostly SEF. The complexity for us mostly comes from the fact that business continuity in a very simplified way maps to higher availability and disaster recovery again in a very simplified way means backup and restore. So the functionalities that you may want to use are quite different. What is also quite different is that SEF is three things in one but it's not true that you have one feature that you can use for all the three types of storages that you have. You actually have to be specialized for block for objects and for file systems. So spreading storage over multiple clusters it does help in reducing the impact of single clusters going down. There are valuable features for active active and active passive setup. What we have learned from experience is that it's really essential that we go and talk to our users at least the main users to make sure that they are using the storage that we provide in the best possible way and they are designing the service in a way that is reliable according to the storage that we provide. One question that you may have is if business continuity is so critical for you why you just don't use stretch clusters. We are testing stretch clusters is definitely something we are interested in and they seem to work extremely well. We have a testbed with Quincy running. In this case we have a bit of a problem in trying to integrate this with OpenStack and specifically with Cinder and Manila controllers because as I said at the very beginning we will have two data centers but the controllers are expected to be fully isolated one from the other. They are actually expected to be two different regions in OpenStack so we are planning to cut the crosstalk as much as we can between the two data centers because of business continuity reasons. So the problem that we have here is that Ceph is able to operate in this way. You may have block and file system storage or even object storage which is consistent across the two data centers but then we have to work around on the OpenStack side to make sure that the OpenStack infrastructure can leverage on top of this and use the storage that exists in the surviving data center in the event of an outage. For what concerns disaster recovery? Okay we're struggling a bit to be honest to do proper backups because as I have already mentioned we started to look into doing backups specifically for block objects and file system. Is there another way of doing this? Can we backup a Ceph cluster in its entirety? Maybe it would be interesting to explore but even if you think about the three storage types that are provided there are actually quite many different ways to build around and get this functionality available. Okay I think this is it. Thank you very much. I'm very happy to take questions. We have a question. Yes? Okay the question is how do we do the MDS pinning? The answer is that in practical terms we know that some users are going to be very very demanding and they use Ceph FS by natively mounting on some very metal machines or VMs that they own using either the fuser or the kernel mount. So these are these are other colleagues in IT that run on top of Ceph and opens like to provide other services so we basically know them and we talk to them and we can coordinate this thing. It really simplifies our life the fact that in that case the the it's not exactly a Manila share that can be provisioned on the spot and then deleted. They get a share through Manila but this is expected to stay there for a very long time so we don't have to to chase them and to track them because they are basically not creating and deleting some sub file systems and sub volumes constantly. Every now and then we do scans of what is going on on the Ceph FS clusters and if possible we try to redistribute the load in in in an even way in the past we I mean Dan was actually trying to use a pinning ephemeral pinning sorry it's actually this slide it didn't work extremely well because you identify the user that keeps the MDS very very busy the user gets moved to the least busy MDS but suddenly that MDS turns out to be the most busy because you have a new user that user that has landed on the MDS so you basically what was happening is that you keep on moving that user around and it's that's quite of an expensive operation so this is something that we have to reevaluate because there has been many improvements in this context at the moment we were we were not using this functionality yeah yeah so okay so the question is the latency between the two data centers the latency is very low they are a few kilometers apart one from the other the network is one big network that includes the two data centers so the second data center is not there yet I don't have the numbers but the latency is expected to be extremely small in the or I would say below one milliseconds in the worst case in the order of a few milliseconds yeah so the the impact is very small no it okay if I remember right I think it degrades if you okay it's actually hard to tell I think that if the usage increases then it becomes more evident the main problem is that the whole MDS is a bit it's busier overall so it's not only the shares the some volumes with snapshots that suffer from from this problem is everyone using that MDS so yeah if your cluster is very big and pretty loaded this becomes more more evident and clear to you if you have a cluster which is relatively small in size or your your clients are not hammering they are not doing many many they don't you don't have as many heavily heavy metadata workloads running on the MDS it should not be a huge problem it may be that this is it let's say it's below the noise threshold and you would not even think you have a problem it mainly depends on on the activity that you have on a cluster we have something like 15 20 000 requests a second so if that is throttled it would be pretty much clear from the very beginning that there is a problem and in general slow down we do not so we have those on one cluster which is relatively recent so it's not heavily used and and the functionality just works perfectly we are just a bit scared to to push the button and enable the snapshots on the main production cluster because as i was mentioning it has some tens of thousands operations and it's not something that you can i mean once they are enabled and someone takes snapshots it's not easy to to roll back and to recover from such situation yeah absolutely yes yeah yeah yeah we also have dedicated MDS servers with quite some memory i think it's in the ballpark of 192 256 gigs but yeah you'd still see this problem it's not it's not a cache problem it's not a problem of the memory that you cannot locate it's it's a problem that relates to how much time you spend in this loop and MDS MDS is pretty much single threaded so there is no obvious way to to offload easily somewhere else we have we have four plus four we have four active and four standbys the standbys are just simple standbys they they do not do stand by replay we have another cluster for for the hbc users that are doing stand by replay this is just one active MDS and one stand by replay and and that also works pretty well yeah any other questions okay we go for lunch thanks a lot