 My name is Jose Castro Leon. I'm the responsible of the cloud the private cloud solution that we have a certain and This is gonna be like the last talk that the sir certain folks from the cloud team are gonna Look at that. So we are gonna talk about the storage in the same private cloud And if you know if you want to know what is? This part of equipment is a cryostat that has been installed during the launch at down to Into the tunnel that is gonna be used in the high luminosity Let's see version that we are gonna have later this year So what what we are gonna do is basically? Interact interact a bit of what we do at CERN and what we do in the same cloud service And then later on focus on the clouds on the storage area all the design choices that we made The kind of the great things that happen and also the kind of the pitfalls on the decisions We made and all the things that we have been experiencing over time and try to address them in the future so The CERN is the European organization of nuclear research is the world largest physics lab And it has been founded by 12 member states in 1954 Now we have collaborators all over the world in 23 member states and many of the associate Countries and what we do is fundamental research in physics We have many many many experiments the kind of the most known is the four experiments that are in the large Hydron collider, but then we have many more that they were like seen in the right-hand side of the slide so when you inject hydrogen Into the setup Ionized hydrogen then you pass through the whole history of accelerators in the in the complex from the booster to the Linear accelerator booster the proton synchrotron the syncope the SPS the single proton synchrotron Until you reach kind of close to the street of light in the let's see in which we Collide those particles in one of the four main experiments, but we also have more experiments Underneath kind of the most common that you may know is the anti-proton accelerator that appear on the Demons and dragons So to give you a bit of the size of the complex we have two sites The one in Megane and the other one in Prevezan We are located in the in the border between France and Switzerland and the LHC has 27 Kilometers of serious conference and This is just to give you a size of the scale of the largest machine the mankind has ever built and to support the X the collision the analysis and simulation that the physicists are doing with the with this machine We built the The cloud service was built in July in 2013 by a team of team of engineers And we are focused on providing resources for the for the whole Organization physics and also non-physics services What we have is a fleet of iProvisors based on sent to seven That we hopefully will have sent to seven stream sent us stream eight Soon and in a single data center that we have located in Geneva And we are gonna add another doesn't that in the other site in Prevezan soon But I will mention that later is Deployed in several regions that are not fully independent in a highly scalable architecture with 48 cells across the site and Maybe you you get surprised that we are running the stain release But this is because we are carrying over the same cloud since we started in 2013 upgraded in place Although some of the services we have are already moved into the channel release To give you also an overview on the stats we have 9000 servers manage and If we look at the kind of the changes that we did in the in the last couple of years We can change the fleet that was taking care of the bats compute farm from virtual machines to Bear metal so this is why we got a drop on the number of iProvisors that we are handling the number of Users that we have right now. It's basically the services that are hosted on the organization to do websites to do applications to design of microelectro to matter micro electronic devices and so on so forth it's more or less kind of the older resources we have and This is the bus the the services that we run in production currently so we have On the upper part we have all the services that are known by you guys in terms of open stack Or open infra, but then we need to build some code and meet have some glue to offer To to be able to offer it in production. I'm talking about a great in the data that we have collected all the providers to monitor it We have a automation tooling We have proven to test that actually the infrastructure works in in scale over a longer period and other integrations And we need to make it to adapt it into the park and What I'm going to talk about is mainly these three components that are syndrome and glands that are conformed this storage area in the center cloud What we're gonna do is like basically start from July 2013 and going over the timeline from the last almost 10 years and see all the kind of the pain points and decisions we made and what are kind of the The takeaways that we can take from those so then you don't come you don't take the same mistakes as us Would say or you can benefit from all the learning experience that we had So the thing is at the beginning we started just with glance. It was pretty simple We just need that and it was we were looking at adding additional volumes at that time was called Nova volume but then then it was super city in a different political center and we Just jump on and add it sooner right away For support reasons we'd introduce it to Technologies or total is tax mainly KBM and Seth for Linux VMs And I pretty and net app for having both having VMs and volumes on Windows hyperriders What it was like extremely difficult then was after quite some after some upgrades that we did on both setups having a small team and Need to maintain these two different technology stacks and to even upgrade that and we just just to have a fun fact not having root access on the net app filer and Connecting to it and every time we were operating in there Realizing that the ACLs for the driver were changed and then you need to look at again And it was like failing every single time If you have root access that's great, but if you don't have it that's we have so that is kind of a pain It was a big pain on our side So what the first thing that we did it was like okay? That was a bad decision at the moment and then okay Let's focus on on consolidate everything on theft We did a first investigation Running Windows on KBM and that was okay, and the main reason of having the Support the tech noise tag with with hyperb was mainly support to get support from the vendor It was nice because you send the whole package in this case to Microsoft. It's like this doesn't work Please have a look But we never use that one So we decided to launch campaign to recreate all those machines and into boot from volume machines on KVM It was accepted by the by the users at that time and That was kind of as a last resort measure to try to simplify the setup and make it easier and more scalable And what we end up also doing is like we retype all the net of volumes because we are having quite some pain managing it But then we hit the first the first issue is You know, I mean when you have hardware in production then after some time it gets out of warranty You need to replace it and the setting kind came and we need to change the monitors and the monitors have IPs And if you look at how the block device connection is calculated in sitting in Cinder when you attach a volume Those IPs are persisted in the Nova cells So then if you change those IPs and you provide all the monitors You just hit this by that we reported. I think like seven years ago something like that The funny thing though is that the client is able to connect to the new months So the protocol is able to if there are more monitors in there the client will negotiate and change those on the fly The problem reality the problem underneath is that if you restart the boxes it will rely on what you have in the database that is grown and Those machines will never be able to start so what we ended up was writing a script that allowed us to go through the databases and Change those IPs that are persisted on the on the on the database On all the Nova cells we have and also on the Cinder this and if you want to have a look is in this in this repo over there The second issue we got it was mainly I use it was like Triggered by an alarm that was raised on the setup that the back end was offline And they were like we were completely surprised by why this back end is busy Searing blocks on a volume so If you when analyzing that was actually a user that had a volume with many allocated blocks Was trigger so trigger the deletion of that ball so trigger deletion of volume you look at the back end There's only a single thread it will go through the through the driver get to the volume go through all the blocks and will create will zero every single block this ball this volume has and Preventing any other operation of the volume on the back end So you have a single user producing a DOS on your nice and highly reliable back end So that is something that we spot and we what we did was Contribute to the community an offload deletion that would basically is just having a second thread So the main thread will move the move those objects into the trash and then the helper thread will go kind of catching up later and Go those go to those volumes and then delete those searing those blocks and What we need to do just to check that the helper thread is never kind of crashing behind We monitor like the truss lines of every single back end we have and you can see there in the graph There's like we hit it on the 25 for May. We hit a volume that was like Stack stack delete in but it was it was able to catch up. So never fire the alarm. I Don't have it in the slides, but we actually we have the same thing on Manila and Thanks to the Manila contributors the the change the the driver that we use for stuff of s and Now this is no longer the case on Manila Swiss working perfectly fine Sometime later we added s3 the s3 service that was the last one added into the setup That the safety and offer us a rather skateway API to connect to our cloud Just to just to let you know this this s3 gateway was already using production by the Two big experiments Atlas and CMS it had already users. It's not something like you like When you start a cloud you deploy a product gateway that is exclusive for you. You just deploy one that is already available So it has already other keys. So the so okay, let's this give you a try We configure it and just you start to see that all the operations that were done on that s3 Server were delayed We're because all of them were there were trying to validate the easy to keys against our keystone server That was that was adding considerable latency on every API call that the every HTTP call that the s3 service was receiving so we had to turn it off because it was affecting those the two big experiments and We ended up writing another script that that again I can point you to that one that synchronized keys Between the what we have in the cloud to the rados rados gateway accounts And that's it at a 15 minute interval that is easy kind of deployable it works for us, but then Sometime later on the the after reporting to the set developers They realize that this order needs to be changed it on this on this typical on these cases So they just change it in a way. They can be configurable So now we are looking at changing it again and see that that it works properly The the thing is once you have added these these kind of service There's no quarter support embedded in open stack So we had to build some tooling to offer the same kind of quality of service as any other open stack product That you may need to do if you add it in the setup So let's say we have all the services deployed So the first thing the first thing is like it comes to your mind What happens if at during the night? There's like something burns and and you need to keep the API's up So from the service perspective relying on a single backend doesn't fly If you look at the services that we have deployed glance and the estuary gateway They are kind of personally scalable and in hurry ha since the beginning In Cinder we just added the ha at all levels with our coordination cluster It's like we are using to keep it for that and all the back ends we have configured we have They are configuring a cluster mode. So then they are all up and we lose like we need to lose that the kind of We have currently five so we need to lose another four You know that we start to kind of get some noise in there in the API's From the manila perspective we have the API in a scheduler in active active mode but then we still have a kind of an issue with the With the client because it has exclusive access to the CFS cluster and It's like that prevents that we have another client connecting to the same CFS cluster And it's something that we need to look at and or maybe it was fixed already Even though that we are running kind of a recent recent release Something this is something that we hit hard on us And this is was kind of something is starting that when we started to deploy the safe clusters They were very very reliable and that's a problem actually We're that reliable that we ended up kind of Creating one type per QoS setting and per packet to not have a scheduling We don't need they say it's it's like fully scalable fully reliable. We go for that the moment you start to add more QoS settings higher throughput Encryption you start to you start to get more and more volume types and Just try to think about it if you offer a user a quota request with 20 different Fields with volume types with different QoS and this is with encryption. This is with not the trip encryption This has more a ops. This has less at the end that is completely They cannot understand the difference between those I will they will just go to get the give me the sound of When I care I'll take care of that. So then at the end having this extremely Streamly Number of volume types that provide different features actually they are not they have no longer use so the you have at the end You have an unbalanced use on the back ends and At the very same time of having these volume Volume type explosion. We got a bag on team calling issue on the one of the set cluster That actually teared down the whole cluster that were very reliable. I'm very and they have a bigger uptime So what we ended up was like, oh, let's have a look and then let's bring back the asets Properly that we didn't have from the start So we were adding more back ends more set back ends on the on the cinder setup There was okay, we can have a lot we can have a look and see if we can do it We can do the changes in Cinder so you can support from a non a set setup to an inside setup But it is nothing. It's not supported anywhere But if you look at the Cinder change the David the schema is like Let's it's just changing a field in the database Where is this volume manage to you stop pretty much everything you change the field in the database? That should be that should be easy, right? So we did an intervention for for doing that to change that so we enable three volume three zones on the main volume types, so the main one is like this is standard one that we see over there So you what you expect just after enabling this it's like the requests are equally distributed about the clusters All right, this is what we were expecting to but actually they were keeping to be added on the Older one that were we were trying to evacuate a bit The reason for that is that We those classes they are not they don't have the same size They kind of the view one was much bigger than the other two that we kind of build them on the fly to have all the availability zones for storage And actually there's a setting that's called the masks over subscriptions ratio That was hitting us so badly. There was like making only the only one that was like useful It's the only one that was available was the beastly cluster So what we ended up doing was like tweaking those ratios to make it them similar And then after that then the the volume requests were kind of hitting the other two as well and the idea kind of later on is to Remove that feature once those are kind of equally balanced as a consequence of this introduction of Interaction of AC or face heads on the setup. We added another Cluster in the critical area. That was a cluster that was like an old needs to be removed And we need to evacuate the clients towards this new cluster To replace it of these existing capacity Scheduling changes was just okay. We just took it So then it's gonna choose the new cluster and then existing Customers on the old older one on the older all their availability zone on the same type We'll just need to retight and since they said that will be enforced later on The idea is that the user just by himself can do the retight Retight operation, but the only concern is that you cannot have any snapshots because this those cannot be moved The idea is to use the migrate command the migrate path on cinder although we have one One issue that is what happens with the machines that we were created from boot from volume because those cannot be detached and But for that with the what we are asking them to do is I just ping us Create a ticket on onto us. So an admin can intervene and Detach those volumes and proceed with the change there with the transfer so if you are As takeaways of these of this talk if you are if you are doing a Conversion of any image that is happened at the very beginning when we move from KVM So from I perv to KVM There was an image conversion because the formats that were on on on I perv And what it was send them down to roll on Seth were not the same And there isn't there is like a conversion that you need to need to happen in the cinder controller So you need to prep you need to have space or additional storage from the cinder controller if you want to do that If not, we will get a very nice alarm as out of this on one of the cinder controller will kind of just pop off If you are doing a similar operation like migrating volumes that we are doing now and You are using the default migration path provided in cinder. Just have a look on the block side setting This setting establish the amount of memory that you are gonna Read from the source back end and you're gonna write on the destination back end So depends depending on the size of that setting you may get more throughput If you tune it properly you get more throughput and then the transfer will be much faster and Something that he does like a long time ago was like Some users delete those images that were used for some experiment workloads and those images need need to be back This is just for To execute it again to validate that the results of their of the new method of Detecting particles that were compatible with the with the results I have back in the past So we built some archiving of the deleted images that we keep for longer periods For keeping for allowing those users to come back to us and as please restore this image for us So now jumping on the things that we are currently looking at Manila is it's kind of the uses of Manila is kind of Enormous now We have plenty of covenants clients and Those are using Manila CSI to deploy to have shares on our setup and the usage of Manila It's exploding the problem. It goes it goes so It's now so famous over there that the clusters are having more and more shares and Something that we are you have been hitting is like the start-up times of the process that handles those shares is taking like 30 minutes Just to start and it's because it needs to calculate all the shared locations for every single share that is hosted on the path at boot time So this is something that we need to look at and need to address But then you only see it when you reach certain size of the setup The other thing that we are looking at is the performance on RBD for Windows of this boot from volume VMs that we have The This is probably could be our fault But this is also the way that the Windows at boot time and also in doing upgrades They do the IO in a different way that links works as the we what we have been seen is non-aligned IO Heating the RB the RB cluster. So then the the limits that we have on IO ops They are heated way early on so then the performance they are getting is kind of pretty poor We did apply some workarounds to define the physical and logical block size for those devices on the hosts So on the VMs But we still have this problem with performance and this is something that we may need to Think about it and maybe change the Qo settings that we deploy and All those decisions just need to think about when we deployed it to the certain cloud was built in a it was Constructed in a data center was not meant to be not designed for hosting IT services like we do now So some considerations like for example ability so we're not considered from the start and If you look at the kind of the expectations that we have for the upcoming years on the accelerator and mainly on the next Front of the LHC that we need to have more capacity and just for that. We are building a new data center in pre-basin that is gonna be When it gets fully deployed three times the size of the data center we have right now In that as a data center what we are gonna do is like consider a sets from the start and Have a dedicated control plane for the open for the open stack cloud in there completely independent We want to use this setup as as the baseline for disaster recovery and business continuity of the of the of the site Things that we are looking at are changing the layout of the IP providers We may go to be in a set up that is without discs or Both from volume everywhere and this is just to simplify the kind of the the app advisors and keep the kind of a mender really costs lower As we want to offer business continuity and disaster recovery. We are looking into seeing the replication replication also surface replication on object storage Multisites setups for for S3 to be kind of transparent for the end user where the data is located In different in kind of in conclusion what we are doing is like rebuilding the building blocks in which our users are kind of constructing their applications and go more or less like what everyone else is doing and getting all the feedback you gave us and implemented in production and This is gonna be exposed in an Region as of the other five we have but this is completely independent So it means that we need to sync quarters. We need to create projects on both sides We need to do all these additional overhead to manage it and I just Wanted to thank you for for attending this talk if you want to know more information about us You can just go to those one of those links in the tech blog. We have more information about the cloud and what we are doing all the code that we have is Open source you can get whatever you get from there. We have the local patches. We have scripts we have a lot of stuff in there and I would just want to thank you. Thank all my team colleagues. We had their kind of Hard work, I wouldn't be here and presented this to you. Thank you. I think I have two minutes for questions Sorry for that Please use the mics if you want Thanks for sharing such a great experience Just out of curiosity What release of self? Are you running right now? Sorry? What release of self? Are you running right now? So we are in octopus in all the clusters Okay, thank you. We don't have Pacific we are testing Pacific, but we don't have we don't hurt yet Hello, hello, you are using the same clouds in the beginning. Yeah And what kind of storage back in are you using for Nova and Glance? So for Glance, we are using we are using safe in a different pool And for Nova what we are using is a female store So we are using the female storage available in the app devices so that this local disk on the app devices We are not using the safe cluster the safe cluster as for a female So at the beginning you were already using Stuff for Glance. Yes. Yeah, that was kind of that our first usage of self on the cloud. Yeah, okay. Thank you Hi, you mentioned about using some services with Really, yeah Maybe I missed something. Can you repeat or explain? Yeah What's what what kind what services are you using which are compatible which has yes have compatibility with other version Let me Okay That's this is one. Okay. So the thing is So if you if you think about it the thing is all the services except like a couple of them they don't have Bindings between each other. So actually something is something that we discover very early on So actually the only ones that we use that are kind of bind together is Neutron and Nova The other ones we can go further up in the in the in the setup. So actually Sorry, it's in the novel Neutron and Nova Neutron and Nova. They are both in Stein release Right now And this is like what is the last last hurdle we we have to move out of center seven and they start doing a standard stream 8 so once we have Those services move in the train that is a kind of the compatible release between these two oasis Then we can just jump into into train So just jump into the train and start having hypervisors moving things around and start catching up The remaining services. They are way higher up. We have the old setup except glance so all of them they are in Wallaby and Glance has been just upgraded to China Because they talk about they talk with the API's to download so Nova talks with a glance API is to download an image It doesn't need if the API is compatible. It just works So just the only these two that are Neutron and Nova they are doing RPC calls underneath that this need to be tied Thanks What about case keystone keystone is in Wallaby now Everything is in Wallaby except this second thing you mentioned about Windows performance on our meeting. Yes crappy. So Have you Have you tried to maybe delete delete or set to zero page file on Windows? and we did maybe it will We did not get any performance We did not do any tuning on the perform the VM itself We just kind of doing with the standard configuration that we do We have all all the windows images on all the windows desktops. It's actually the same configuration settings Well, we didn't go that far the just the the pattern is just different I mean the What I can tell you is like if you look at the pattern of IOPS during boot time is Extremely different at what it happens later on What we saw later on is actually the physical block size and the local block size that we set up that is 4k 4k no, it's emulated 512 what we have and then the thing is that you see that the request and they are properly aligned But then at boot time we don't know what it happens But actually we are getting much smaller IO rates and then you can kind of in random behavior So we are trying to understand what triggers that and also we may we may end up like just increasing for a short period of time The burst limit on the IOPS operations that will alleviate the kind of the behavior. They are suffering So we mean we have some still some tunables But I'm not sure if like by by cleaning the page file will we can fix it My some kind of You know I Don't know I don't test it. I didn't test it, but maybe it's a pretty specific case I mean, it's like just we just see it on boot time and unit upgrades. That is kind of different what it was like no more Thanks, we need we need to have a look. Yeah Hello, thank you for the talk Why are you using Xena version for glance? Is there any specific feature that you were looking for? No, no, no, it's just that the idea is like we need to kind of continue. I mean the thing is When we started the our let's say goal was be like not so far from the ups from the upstream or a stable latest stable release And the thing is we were able to kind of catch up and keep the kind of the release cycle kind of close Why we're far behind on Nova and Neutron and this is just because we are Assemblyly using of a network and we are still having customers using having machines on of a network and in the train release This has been kind of removed from the code base So the thing is like we are kind of a bit stuck behind recreating boxes Doing the kind of campaigns of recreating machines and the new one in the neutron cells So then we can just jump to the next release and what we are gonna do is like just once we are there We are moving up faster and keep them more less close together One thing I can think of is for example seen there. I cannot upgrade it from more than wallaby The reason is that they deprecated one of the API versions from wallaby to Sheena the now deprecated version V2 and Nova is still using V2 for attaching volumes in the person we are running So actually like I cannot do the upgrade until Nova has kind of a start using the V3 API So there's some kind of more consideration you need to do if you kind of in this model It's kind of much much much simpler if you have the whole cloud on the same on the at the same level But we need to kind of play a bit on on feature wise Thank you Maybe one last question. Yeah, how many how many people are you to maintain all of this? Sorry How many people are you in your team to maintain? the entire structure of self that you have So how how many people are how many people okay, so so the thing is the cloud team is actually So we are kind of now reduced like we are eight in the calm team eight ten people But then the set the set clusters are managed by another team There's like another team or four five people that just they get some stuff The thing is like the set clusters that we are using they are also used by all the By other teams in IT that have access to those clusters. So there's like It's not all together Let's say like another organizations is like kind of a split and then it's just like kind of building blocks for other services and they all do maintenance and upgrade You mean in the safe team? Yeah. Yeah. Yeah. Thank you And we have different qualities different setups some with spinning rise some with flash some mix It's like enough a variety of things and behind our setup is like I think it's like eight different set clusters that we have right now With different setups and so on Okay, thank you