 So good evening. I hope you've had a nice lunch. So my name is Anastasis It's a Greek name. It's a little bit weird. So just tolerate me for that. I work at CERN and I'm going to do something a little bit strange today. I'm not going to speak a lot about CERN we're going to speak about OpenStack most of the time and I'm going also to tell you a little bit of my story how things started for me with OpenStack and how I First opposed to the problem about booting. I don't know if anybody of you had ever experienced problems problems with when when they try to To boot a very big infrastructure and you have a lot of requests at the same time anyone had problems scaling booting with Glantz Okay, I hope I can give you an answer today for you so When I first came came at CERN they told me that I will work with OpenStack and This specific machine here, which is quite amazing Kind of machinery. It's like more Star Wars stuff. So I was very very excited about that and When I first saw it that my first question was okay, how many data this thing is producing and I got my Answer pretty fast because they told me one of the remarkable things about those machines that they are producing a lot of data And when I say a lot, I mean almost 40 terabytes per second Which is a tremendous big number. I mean not that the scaling of any big company I think not even Google can handle those big some amount of data. So My question again was okay. So what do we do here? I mean, how do we manage all these things? and This is a pipeline of how We are able to do that and in the very very beginning We are using electronics FPGAs and stuff like that to filter most of our data and to reduce the 40 terabytes per second to Almost 300 gigabytes per second and then we have in the middle You see we have a complex of clusters to be able to do even more filtering about the data And then we propagate all the interesting information that we found on on a big Network of computers we call it the grid which is a distributed Which is a distributed alliance let's say of a lot of data centers around the world and The grid is where the analysis is taking place So we have the filtering in our in our site and then we propagate the interesting data to be analyzed inside the grid so To be able to filter The amount of data of course we need a very strong complex and this is the heart of this complex We call it the HLT the high-level trigger, which is a big and very strong infrastructure just by numbers we have almost 1,300 nodes and Almost 15,000 of course on those machines and a lot of RAM and a lot of disks and And just from perspective there is a nice benchmarking that we use at least in the physics world We call it HEP spec 06. I don't know if anywhere ever so that benchmark No, okay, one or two people. Okay, so our our cluster it scores on 199 95k and Just from the perspective that tier zero you see and all tier one are The data centers around the world that combine together are helping for the analysis, so only now side we have a very very strong We have a very very strong infrastructure to do our job and Then my question was okay if we need so much CPU intensive code to run your infrastructure. Why do we need the cloud on top of that? And and I got a very nice email that had that had no notes had no Words inside not even a subject and it was just this image and this image was telling me that If you see how the lines are going, this is the amount of data we extract depending of the days So there are some gaps between where this infrastructure is not working at all so the idea was Let's do something to fill those gaps to be able to do something useful between The situation when we don't aggregate data data from the LHC and this is why This is why we built A cloud around that this is why we need it open stack because We had somehow to be able to serve this infrastructure to two completely different groups of people On the one side we have the people that they are aggregating the data They are filtering the data and on the other side we have the people that they are analyzing the data so Using cloud and virtual and virtual machines was the best way to go those two people Those two groups wouldn't interfere one to another. They are isolated and They could both work very nicely So We set it up grizzly open stack back then two years ago and We could only deploy the basic services all the needed services like Nova Glance the Keystone and the Nova API is around and The reason is that we didn't want to interfere too much with infrastructure because for those key services We had to dedicate machines on that we had to extract them from their main purpose to run on the Data filtering and there would be dedicated to only to the cloud so We we were not allowed to use that many dedicated machines and we also Wanted we wanted to be sure but that everything works properly and we use chorusing for that and all the other service like Rabbit MQ and Maria DB. They're working in a failover mode but also the reason that we Used all these services with a lot of Instances is not only to be sure that everything works and to be redundant is also about scaling and when you have and when you have a lot of requests and you want massively to boot already your infrastructure at The less time as possible you need somehow to handle all this quest because as I said, this is a cloud that is Running in in only some hours and you have to not spend time on starting your starting your virtual machines or spending too much time on things that they are not about computing so Just in one minute. You will have a great amount of requests to boot up everything So you need to handle that the schedulers the APIs everything must to be in multiple instances. So all these requests can be propagated correctly So we ended up in Situation like this one on the first side. You can see Just for controllers. These are our special machines that we have our core services there and On the middle we have the gland service with Two instances the Maria DB which is dedicated smaller small dedicated clustered with three machines and On the all the other remaining machines are just the Nova Hypervisors nothing else. They they just have the virtual machines there also about the networking All of our machines they have only one giga pps link because what we care the most is The the computation and not that much the network traffic So they have regular ethernet with one giga pps link and they are connected to two networks So on the one of the downside We have the network that all the information for opens that are passing through and on the other side is where the virtual machines are Connected to the outside world to communicate with the internet or whatever other needs are necessary So when we boot it and we did our first test We ended up in a case like this one so on The left side you see the number of the virtual machines that they are booted every minute and on the x-axis is Are the minutes so when we asked for all when we asked for almost the 400 machines to boot instantly We had to wait almost two hours to have it done. So it wasn't a very good demo and This is where we start actually thinking what's what's the problem what what's going on behind and how we can solve it of course, the problem is the massive booting nothing else and the major Problem is how you propagate your images to the to the virtual machines So we made a small list about what should be our Target and how we can solve the problems one by one So there is a trick in this whole story, which is the Nova Kass Which is where the Nova when you ask for an image Nova is storing your image if you want to reuse it for later on But the problem was with us that this image is changing frequently So on the one hand we were very happy that we found that on the other side It was a very helpful so networking again a bottleneck for the same against situation Maybe one or two runs would be okay between one or two three days Because we could use the same bit the same image But afterwards you have to change it again, and maybe you will lose your whole day just waiting to propagate your images and Also as I said again We are not very allowed to play with a lot of different tools to do what we want to do So the first thing we thought it was okay Let's revise there's review a little bit what what booting is and how open stack is working This is a very simplistic approach of course. I'm sorry for my graphics. I'm not you know very special on that and On the very left side We have the request that we have in a regular user asking for a virtual machine Then the request is going through the API standard Way to do it. Then you have your scheduler where Some nice mudging is a magic is taking place your hypervisor is picked correctly with filters You add them stuff like that when Nova is being picked and you know where your supervisor is the Nova is communicating with glance and ask for the image to be served and then you have your virtual machine so That that was the whole idea from my perspective. This is a very nice diagram I think this is the correct way to do it when you have distributed systems I don't know if someone disagrees or agrees with me, but I think it's it's the obvious at least way to do it the question is what happens if you have a lot of requests and If you have a lot of users and a lot of it's a machine that must be booted at the same time so the situation becomes something like this and I changed a little bit the colors want to be more obvious So the API because it's quite a simple role to play Well, not very easy, but still easier than the scheduler It should be loaded but not that much Then you have your scheduler that is going to play a bigger role because you have to communicate the database pick the right hypervisors run some queries underneath to pick your hypervisors So it's going to be much more loaded and this is why we have it on four instances in our case and Then you have your hypervisors, but your hypervisors are distributed By definition so no big change there whatever had to do in one request that have to do the same for 1,000 requests On the other side you have glance that is going to kneel from this whole work load because has to serve simultaneously However requests you have so this is not really Working if you have just of course one instance of glance not something Not to distribute it system to do something different. This is just you know the simple approach So I think that you would agree with me that boot makes time And this is why we had the diagram that was linear almost we had to serial serially serve the images one by one and In our case that we run in a certain first action. This is a tremendous big problem I mean we have to solve that and I think that You will agree that is not only us that we serve that we have in That that we are in a certain for structure But I think even big companies that they want to serve Virtual machines to their users They don't want to spend time on waiting for the virtual machine to boot because this is times spent time is money And so on of course there are solutions for that problem the obvious ones Having a distributed file system. I mean I think this is the way that most of the people are dealing with it with this problem The other is if you use an object storage like Swift again, that was the recommended at least the solutions some time back and Well, the obvious and simple solution for students like me at least is okay. Let's have a lot of glances everywhere and let's boot from there But it's not that simple to do that. This needs a fair amount of complexity to actually be able to handle all that and Okay, maybe a lot of glance instances is not that a big deal but working with self or working with other Distributed file system of object storage is it needs quite an amount of knowledge and sysadmin skills to be able to organize all this stuff and Also for if you have a lot of glances on the other side You need somehow to invent your own way to handle and sync all of them together Because if you add one image on one, you need somehow to propagate these changes to the others So we need you need to do something And again just to be sure that we are on the same page We have a massive infrastructure short time to go to to boot everything to be able to to give CPU power to our users, but on the other hand, we have very limited permissions and Very little manpower actually to do all this stuff So the questions arises and is how can we improve it? How how can we exploit the Nova case in our? Advantage because it's it's a nice feature how maybe can we reduce the image size somehow and Of course the biggest question of all is how can we distribute it in a better way without too much overhead of? our administration So our answers on that we're starting from top to bottom for The Nova cast we said that we said that if we know the image that we want to serve We can just push it there before the request comes before the request comes through So this is a little bit the case like big organizations that they have an ocean They call it golden images. I don't know if anyone heard about that So when you have an image that is very very common and you know a lot of your users They're going to use it you just drop it there By default so when a request comes nothing of all this is needed your your your machine is just boot it up instantly For the image size we said that we had some problems there because we are not really sure for What kind of image we're going to get from our users? Because the users are completely different organization out of our scope with just serving them So we said can we compress the image and I'm going to give you some really nice Numbers later on about the compression and the last thing is that How we are going to distribute the image and we said The simple way the simple thing to do is maybe we could use some regular HTTP proxies squid. I don't know. You know squid heard about that It's a very old-fashioned way to to do HTTP things at least before not So we said we have some proxies around maybe can we explode them in our need to be faster on the distribution So our first test was something like this. We said Okay, we're going to set up an HTTP server regular Apache server on the glance side So whenever you have a user that is adding a new Image on your on their behalf we're going to take that we're going to add it on this Apache server Of course, we're going to we're going to compress it And then we are going to set up for squid service on those special machines. We have Maybe even more because as I said we have some other extra machines that they're dedicated to our needs And then the only thing that we need to do is actually go to the machines We already know the image that we want to precast and let's just initialize that we get and Magically, yeah, it works So this is a little bit more about the workflow how actually it works You have your user that is adding the new image on glance regular nothing changed I mean the users they never understand what's what's going on in this case They we have a cron behind that is just checking the file system very simple stuff and We take the image we compress it with gzip Regular stuff again. We save it on the regular folder of Apache server and we Apache server just magically sees it and on the other side We have the Nova compute we initialize we get and the way that we do that is In our case M collective, you know the M collect is a distributed tool to propagate commands So we just initialize it initialize it with one command. Please fetch my image everywhere and The we get is been initialized it we get the image we uncompress it on the fly and I'm going to go a little bit more into that We store it on a temporary In a temporary folder because this may react a little bit strange with Nova and when the image is ready We go we just copy it inside The Novacast some black magic about SHA one naming stuff about Nova and Everything works as should be as I said then one interesting thing is about compressing and What we used it was just a regular gzip command take the image compress it and It's really interesting. Just look at the first line the size we had an image that was 1.3 gigabytes That is one of our examples Not the best one, but not the worst one and And only by compressing with the gzip is going from 1.3 gigabytes to almost 400 megabytes which is Divided by four. It's a big game What is also is interesting is that we tried some other ways to compress it What if someone wants to do the same thing should also Should always think about the ratio between zip time and unzip time because the zip time is something that you You shouldn't care that much how long you should take because it's only it's paid only once on the other hand The unzip time is paid from every Nova compute So you must be sure that your unzip time is really fast if you have a problem if you have a problem with your networking on the other hand if you have a very fast Networking and you're okay with a propagation. Maybe you can go in ever even bigger numbers and Win a little bit from the size so it depends You have to find your use case So from only From from the example we had by this we tried to play again and Can someone imagine what was the result? It was instant All right, of course, I'm cheating a little bit here because I'm not telling you how much time we needed to propagate the image but I'm not actually cheating because the propagation time was just the first line that is down So the propagation time was only some minutes and then we got the request and everything boot it Instantly because it's the everything started from from cast from the cast We also tried the same we also tried to measure the amount of time we need for a bigger installation and This is an example about Almost 1,000 machines more or less and this is the time you need only to propagate the image Not to boot because we don't really care its instant So I don't include it on this diagram is only to give the command to the M collective to give me Give me the image on the Nova and in the very very beginning you see that we have some some Sorry, this is in seconds not in minutes So we have some seconds that nothing happening because the images have been Propagated from the glance itself to the squid services So you need some time all your squid service to be sure that they have the image cast and then when they have it They start, you know giving it also to to the next To to the servers one more trick that we did it was It's not everything here is not ideal a squid is not made actually to do things like that It needs some proper configuration to be able to do it So when we actually do a widget on from our Hypervisors we add a small random delay So not all of the hypervisors Simultaneously start asking for the same image, but this delay is between seconds is if I remember correctly is between Sorry zero to 20 seconds. So It's getting by random So the last numbers that we We we gathered about all our infrastructure in our case was if you go with You know the down way just the normal stuff one glance that is serving or your infrastructure And if you want to do a massive boot Because if you have one glance server and you have sparsely Situations like that. Maybe it works really nicely, but if you want to do a massive boot. No So in the first case we have With no squid and without compression and we are almost to four hours with one thousand two hundred machines If we go only with compression Just by compressing your memory much we go from almost to four hours to one hour Which is a big difference only by adding one feature on the other side If we don't have a compression and go with only with a squid we see we have half an hour So only by care taking care of your network This is the big the big deal to do the networking correctly to be able to propagate your image And of course if you add both tricks you go in less than two minutes as the title at least suggests So that was something that shocked us a lot on how you can Do simple stuff with simple tools because squid and some Apache 7 one command in M collective it Gain us a lot of time To be able to serve our users So we came up with some suggestions on how we would like to see from the community bug And that we would expect to be done and the first thing is that Maybe Nova should think to support arbitrary proxies they already rely on On on regular HTTP requests to get your your images from glance and It would be nice just to have one parameter in the config file Please take this through this proxy. I mean it would be quite easy I think to to be done and also would serve some purposes at least in our case and Also the nice thing about proxies that you reuse known technology to a lot of at least old Administrators and you can play a lot with the topologies you can have multiple levels of your squids to do your magic and The good thing is that your whole propagation is going through a tree So instead of having you know and things to serve you have actually log and steps to to go through which is It's a big speed up On the other hand, we would like to see the compression Just even even if not the user be able to do that Maybe that the administrator would be able to configure the glance to store the images compressed because not of all the users they use Correct formats to do things like that because maybe you have a user that is just storing a raw image So why you should be bothered to to store all this? Maybe you can just compress it and take some megabytes back from that So again the workflow nearby about how We were thinking to do that By the way, do we have people that they are from the novice side that they're committers there? No, okay, because this idea about the image compression was suggested during the Google summer of code last summer and Well, it was not accepted. So I just Another idea that came up that is not exactly in our case, but still interesting to be Explored it would be if Nova was able to To be aware of its cast and when I say aware What we were thinking is that is this when the user asked for a VM and The image is cast in one hypervisor and then they might the VM is been destroyed when you ask again for the same image There is a probability that The request is going to go to the same hypervisor. So why not reuse the same hypervisor that the image is already cast inside and do the things much much faster, so We're thinking something like this you have Nova that is reporting back to database and I don't know if you are aware about Regular open-stack scheduler filters that they pretty much do the same thing. They just have all the information about your Hypervisors and they are filtering on some attributes. So we just want to add one more attribute about which image I have so When the request comes to the scheduler would be able to understand what's going on So about the conclusion for all these things we are very happy that we actually made it work and I think more happy are the physicists on the other side. Also if someone was in the first day and saw the Presentation about CERN the physicists are the physicists are quite demanding people So for more for from four hours that we Our dummy example was needed to do the work. We just need 10 minutes to do that now right now and I think the biggest point here is that we did all this dark magic and all this hack with Solid technologies is not all custom-made custom scripts that has to be rewritten from scratch Yeah, we use solid technologies to do that. I think this was the the best thing and What we think is That there is even more space for improvement and there are some corner cases like ours that can be Easily fit it in the community because I think not all of you have the same problem as we have I think no one knows the whole infrastructure to be booted at once But even if we have a little bit exotic needs It's not very hard to be to be fixed and to be able We to be able to to satisfy them So any questions? Sorry Yeah, yeah, that was one other thing that I wanted to add in my presentation, but my supervisor didn't want to So yeah, we're thinking to to do this with Torrents and Actually, it's a remarkable idea in my opinion. The problem is it's really hard to do it You need You need to think a lot of things about the topology of your infrastructure because everything has to pass some Routers routers. Sorry. So you need to know where where are your acts? You need to know a lot of stuff So it's a little bit hard how you're going to do it and the worst thing is how you're going to give it to the users to Be able to manage that but I totally agree that with a little bit of work. This can be done It should be really really nice Yeah, so When we actually booted all this stuff, it was some versions ago so in grizzly all this stuff are not supported and also, I'm I think this Is not actually pre-cassing in the very new versions. I think it's about You able to cast the image on the glance site So you still have to redistribute it some way and leave your bottleneck is on glance You still have the same thing also the new glance They they if I remember correctly, they start doing a middleware. So I'm not really professional about a professional about the glance side They have a middleware to take care about all these things But still not supporting our version. So somehow we had to hack all this way back The good thing about this, you know dirty hack Let's say is that it's supported in any kind of version because it's completely external to to open stack Yeah, sorry. I didn't hear you. Yeah. Yeah Yeah, I I agree This was just, you know a simple illustration about how you can win only by doing a very simple compression You can even use not streaming algorithms You can go even you know to more sophisticated compression in the very end and have even bigger gains or what you need So depends on your case. It was okay for us. We said, okay, just jizz it Other questions No, no, just no I Squid was working It was fine no, no, no, it's when Most of the times is working manually because When the user is actually when I say manually when the user is inserting The image, you know more or less how much time you want to start the propagation So you give a small time window between those cases and you say, okay now this is cast Go on, take the image anymore Okay So, thank you very much