 So, hello everyone. So we are here to talk about containers on bare metal and the preemptive servers We are from CERN and SKA. So my name is Belmiro. I work at CERN in the cloud team Hi everyone, my name is John Garlett and I currently working at Stack HPC as a principal engineer I should be clear. I'm talking on behalf of myself I'm not I've sort of work as a subcontractor for the SKA, but I'm talking about some interesting things that are happening So I suppose I was going to give a little background to myself. I started in Working at OpenStack in December 2010 about that time frame done lots of interesting things I've been very lucky with the people I've managed to work with and I so I started working on the NOVA project I've been the NOVA PTL in Libertium attacker been on the TC for a little while But yeah, currently focusing on the HPC kind of type world. So Before you get started on containers and Bare metal and preemptible instances I wanted to start with introducing why we've got people talking about the SKA and CERN together on the stage So quite recently there's been a collaboration between The CERN and the SKA. We kicked this off and got together and discussed like where we've got common requirements on OpenStack And an interesting thing is we think today, you know CERN has got some very big infrastructure But in sort of the mid 2020s time frame It's going to be sort of 50 to 100 times more capacity needed for what's happening So there's all the luminosity is increased by a factor of 10 It's a really bad scaling factor and you get some an awful lot more compute than it's happening And the the observation was made that you know, the SKA right now is in a prototyping phase So we're looking at what's needed working with the physicists on what's going to happen and over time the The SKA when it goes into production in the mid 2020s is going to be sort of similar order of magnitudes of the problem that CERN's looking at So we got together and looked at what's What's in common? So do a quick introduction to first of all with What CERN's up to? So Let's go first through this slide This is one of my favorite slides. This is the universe in one slide So we are trying to represent 13.7 billion off years in this slide and What CERN and SK do is trying to understand the universe these very Each questions so CERN what we do there is to try to understand the milliseconds after the big bang So that's why we have all those accelerators and detectors to try to recreate those conditions Of the matter in those early milliseconds SKA looks into all the rest Try to understand all the mysteries of the universe after the big bang So CERN is the European organization for nuclear research is one of the biggest research organizations it was created in 50s And the F quarter sits in the border between France and Switzerland. This is a high-level view of the CERN site and CERN to study Basically the the matter as the network of complex Accelerators you see some of them, but the biggest one is the large Adron Collider That is a ring with 27 kilometers It's in a tunnel one other meters in the grounds and Crosses actually the two countries France and Switzerland you see in the image the lake of Geneva the Alps at the end So what happens is we have this large Particle accelerator the LHC that Accelerates particles two beams of particles in opposite directions and they collide And when they collide We want to cap that moment and for that We have these huge detectors That are in huge caverns also and under meters in the grounds what they are basically they are a digital camera Not a normal digital camera. They take 40 million pictures per second This produces a lot a lot of data around one petabyte of data per second Of course, we cannot handle all that that data. So this needs to be filtered And after all the filtering what we save is the Interesting events the new physics what we think is the new physics Which is worth to to analyze and that is few gigabytes per per second that we store in the CERN data centers Um, so then to analyze all that all that data that they taste distributed Around all the world in different research centers But that CERN we also have our own clouds and a lot of data is analyzed there This is one of our dashboards Monitoring dashboards where you can see the size of our clouds. So we have more than 300 k cores More than 9 9k iProvisors more than 4,000 projects a Lot of VMs most of these VMs more than 80% of our capacity is basically to Execute the knowledge the jobs from the LHC data The rest is other projects and also IT services So we've been talking a lot about the SK and I probably forgot to mention that that actually means square kilometer array So, you know, we're talking about CERN and everything else. What's what what is this SK a so here? We've got a picture of the the radio telescopes The idea is there's a large array of radio telescopes all work together to look further and further back in time Basically by having a better a higher resolution of picture. So keep looking deep There's two. There's actually two telescopes in in some sense There's a first of all, there's one site in South Africa and there's another site in Australia They're both sort of in the desert Away from where there's all the radio interference So you don't want to be there with a mobile phone ringing your friends because you might Pretend to be some kind of pulsar and that would be bad So you need to get away from all the radio interference and have a look at that So the other type of antenna basically looks a bit like some coat hangers that have been tied together And they're about the size of me, you know about about my height and they're all wired together So one's looking at one type of frequency the others One's looking at sort of mid-range frequencies and one's looking at low frequencies. So there's two different Systems, but they're both connected to a very similar looking supercomputer to process the information That's what we'll eventually start talking about but just to sort of describe the flow of what's happening here There's lots of signals coming out of the radio telescopes They go into a digital signal processor and that turns them into UDP packets So it's doing things like from this Antenna, you know, what's the difference between all the other antennas and it sends those UDP packets Through the wire across to the supercomputer, then we have to do something useful with that The scale here is kind of Initially seems quite large and then gets kind of incredible It's roughly the scale so when you look at the the pipe coming into the supercomputer on the right hand side That's why I've been worried most recently because I'm closest to that you've probably got about 500 gigabits a second or sorry 500 gigabytes a second coming in To the supercomputer in UDP and you don't really want to drop any of those and you need to do something useful with most of them That's an interesting challenge So let's move to talk about how we're using OpenStack in here For this particular talk, I'm looking at how we're using containers and bare metal for the SKA and it's a combination of you know, you've got Nova deployed in an ironic cloud as the driver behind Nova and You're creating the containers in Magnum and we're seeing how that all fits together So before I go into that in detail, I just need to give a little bit more context about what's actually happening inside that supercomputer what we've got here is whole load of compute nodes so the UDP packets are coming up the bottom and The basic flow here is that you've got a whole load of image data coming in We need to then write that to storage in a format that's useful. So that's an ingest process and Then a time that's convenient We then put we read that out and process it to get the results and those results then I'd need to get shipped off So that's the kind of flow. That's the rough flow through the system. So how do we sort of make this real? It all sounds very fuzzy At Cambridge University We've been building a prototype for this supercomputer called Alaska It's currently about It's currently at two racks of hardware Looking a little bit like this and we've got Infinity band as a sort of high speed connection on one side and we've got the 25 giga ethernet Where the the UDP packets are going to be coming in? We're trying to walk basically trying to work out how we can orchestrate this as efficiently as possible So at this scale if we start losing one or two percent of performance that's an awful lot of Extra compute capacity that we required. So it's really it's really important to try and not waste that and Put things together as efficiently as we can So why bare metal why containers? For this particular talk, I just wanted to go through some properties of the system to make clear why we're going through these choices Firstly, it's this system is a special purpose kind of system in the sense. There's a single security zone This is we've got the telescope Got all the packets coming in these two places and right now that that's It's a really precious resource to have this big supercomputer near the telescope. So we're being very careful in the workflows and the Optimizing what's actually happening inside here? So there's only the one security zone And because of that one security zone and the real You know really focusing on performance over security. We've chosen to go bare metal and target that a Really interesting requirement for this telescope That is probably pushing a lot of the the boundaries of what we can do is that at times there's some very Interesting things that happen in this guy where you just want to go and look at them So say there's a supernova that happens and you want to go look at the supernova You find that out and you need to drop what you're doing and go look at the big shiny thing over in the other corner Now That sounds great But there's some even more extreme events when you look at this a sort of deep space Very sort of high frequency well sort of deep space big big signals that happen There's any sort of like over a decade. I can't remember quite how many but they're only about five that happened They happen so far away that the Doppler shift means that we might actually be able to Detect them in one frequency somewhere and the globe will turn around just enough To for it to be available on the other frequency because of Doppler shift I think it's only someone was talking about that being a potential possibility I'm not sure 30 seconds will be Quick enough to capture some of that but that's the kind of things people are thinking about how do we see that thing in the sky? So That pushes us to think about how quickly can we deploy in over instances and if we're using ironic Anyone that's watched the bios recently will know that you can get a cup of coffee in the time that the thing is actually turning on Which would be pretty bad because the the nice shiny thing in the sky wouldn't be there anymore I mean you might have coffee, which is good, but you know, it would be much use So we need to sort of have this You know, this has to be the back of our mind, you know, has to be not the back of our minds is a really important requirement the other piece here is that we have to Think about how long this system will be in use and what kind of workflows they're involved It's gonna take we just really we can't sort of say We will always run all the workloads on spark or we'll always use desk those are not options we need to allow for that kind of Choice and flexibility and really sort of make use of modern development Kind of paradigms and make sure that we're not, you know tying ourselves down in that sense So magnum with ironic When we were looking into this and we started this in an SK collaboration We we sat down and spoke about, you know, this magnum thing does that work? And there was lots of nodding from the other side of the table And so we moved on with that and tried it with ironic initially wasn't so fun Originally until oh well, so most recently in Queens has been a big change in that the VM and the bare metal Are both using exactly the same code paths. So we've now using fedora atomic as the base image for both So we don't have these awkward conversations saying you've done some really good work in the VM thing there Why is the bare metal unbroken because it's the same code and We just don't get that kind of You know second-class it is an approach anymore. So that's been a big a big step forward so going to a slightly different view of what's happening with inside the the current prototype for this for the software that's running on top of the Supercomputer I just wanted to highlight a few of the things that we're having to connect in to the to the physical machine that has the containers running on it if you remember my original diagrams I was talking about high-speed interconnects and 25 gig ethernet networks to get the UDP packets The really important thing this diagram is trying to say is The networks there's more than one network and right now Magnum doesn't actually support more than one network So what we've done here is we created the cluster with Magnum and Then we've had some ansible scripts or the ansible scripts are creating the cluster But on top of that they have to go in and add additional ports into the host and then configure those appropriately so it's not It's not perfect. We're gonna have to work with Magnum to go through some of these things But it is possible to have all these three networks attached to your Magnum cluster So they're saying there's problems with the ports, but you can work around them There's a few little niggles we found along the way in that We you know we're using the fedora atomic Version 27 and the version of docker in there is Not as new as people were wanting so we had to upgrade that That's when I discovered that RPMs and yum isn't a thing in atomic which I probably should have known but I didn't So I'm not I actually did the work someone else did but I found out all these interesting things about fedora atomic Part of the follow-on of that is We were wanting to have high-speed storage that I was mentioning we have in finny band They were thought our DMA is a great idea to do with to get the best out of the infinity band And that meant updating atomic with drivers and again hit kind of similar issues But with some automation to get this sorted It's not impossible. It's just a bit of a bump Somehow we managed to break cloud in it in the process Cloud in it's usually pretty robust once it's working, but we made it unhappy But luckily cloud in it's extensible and again, we're able to work around that for now something to do with LVM and having to grow the partition in the way that intro fs didn't fancy anyway with a little bit of help and that happened and There's loads of interesting details, but if you want to read more have a look at the stack HPC blog Generally just have a look at the stack HPC blog or we do things interesting things We try to share them with everyone. So it's a good way to keep track of some fun stuff that's happening So now I've been talking for almost all the talk. I think it's a good time to hand over and talk about preemptible instances So something that we at CERN and also SKA we are looking for is printable instances So let's try first to understand the issue It's as to do with resource utilization if we go to a public cloud We have these illusion of infinity capacity Because users have a credit card. They need more capacity. They just pay the call provider for more capacity However for private clouds scientific clouds like at CERN and SKA. This is not entirely true Users don't pay for the resources that they consume. So what we use is basically we enforce quotas for each project This prevents a user to exhaust all the resources in the clouds Overcommitting so we usually don't overcommit resources or even quotas Quotes also allowed allow us to manage individual project requirements In terms of capacity in terms of CPUs around that they use Actually dedicating a specific set of hardware to that specific project And we all have Services or projects that are more important than others So we always reserve more capacity even over provision for those those projects because they are important In scientific clouds also projects have different funding models. So The project receives the founding the hardware is bought for that particular project And the project expects that hardware to be there to be used when they need So we have all these issues So and these I will try to illustrate a little bit Exaggerate a little bit more here. What is the issue with quotas? Just to stress the problem. So if we have a cloud we'll only two projects Both have quota I exhaust the quota in one of the projects is easy For the cloud admin for the managers to go to the other project and actually locate more quota to the initial project And it can continue to go on if it really requires those resources Easy if you have two projects if you have a few more maybe that is doable If you have a lot of them, this is almost impossible you locate the quota initially to that project and it's very hard then to Allocate more or less during the lifetime of that project because that implies changing the quota and all the other Projects because you don't want to over provision your quota so At the end we can have a lot of resources that were bought for Processing jobs for science and they are not being used because it's very difficult for a project to always have workloads To feed those resources to process data But actually this is not only a problem in private clouds public clouds have exactly the same problem They buy a lot of hardware and they try to maximize the utilization of those resources The way they do it is having different pricing and SLA and policies So if they have a lot of resources available, maybe they Down the price of the resources that they provide offering them with a different SLA And actually they also have a spot market so You can request resources pay less for them and if someone else Requires those resources. They pay more. They will shut down your Your instances. This is the spot market that AWS and Google cloud app in private clouds Open stack private clouds. We don't have this concept Colt is our art limits So what we can do So that's why we start discussing with the the nova team And between SK and sir and how can we improve this and we have we had the idea basically to do with Public clouds are already doing the spot market without market because our users don't pay And these are the preemptible instances So when a project is lost the quota The idea is they will continue to be able to create new instances with the lower SLA if the other project Requires those resources and they have quota and they cannot create because the cloud is exhausted the spot instances The printable instances will be deleted so Few months ago we start to implement a prototype on this The idea is to start initially very simple To talk with the nova team to see what is required in over actually to to easy this task So we have a few specs upstream that are being reviewed back and forth it's going well and Meanwhile, we are also the developing our prototype and the idea starts simple So what we are trying to do now and we will deploy these probably soon at certain cloud is we're gonna have dedicated projects for the printable instances and These printable projects will have unlimited quota. So the user that has access to those projects will create the printable instances when The resources are required These printable instances will be deleted With this workflow We'll try to explain that so we are trying to create an instance Nova scheduler tries to create an it tries to locate resources and They are no resources available. So as you know, the instance that changes to no value those When think that we are trying to introduce in over. It's a new Instance state colored pending. So your instance goes to pending states and Notification is sent to a new service the Reaper service that will be the printable instance orchestration That will consume this notification that says no value those and the two things can happen With us if you are trying to boot an instance that is not preemptible And The orchestrator will try to delete a printable instance to give you a space for this new instance So is what we see there? It will delete the printable instance and the second step is to rebuild Your initial instances that is in pending states if successful your instances your original instances rebuilds and the chain and the state changes to active if not If is not if not is is not possible to delete the spot instance because they are any in the system or Because the flavor that we are asking even if we delete the spot instances Doesn't give it capacity that you need Your original instance will change to error state This is the workflow that I just Explained so we are doing a lot of work. We have few specs upstream some codes the initial two bullet points there are Two specs that are being reviewed at the addition of the notification and the pending states and actually the our prototype You can have a look already in our git lab at CERN so If you have the same requirements for container somewhere metal if you are doing similar work Please conduct us because we may club right on this If you are interested in the printable instances also being as a review the specs give us your feedback You are really interested to hear about you So thank you so much So we've got 12 minutes left for questions There's a clock counting Let's fire away. Oh We'll repeat the question printable so So the question was for preemptible instances, how do you pick which one to well to terminate? Yeah So you can have different policies, right? You could be the one that was created first. So the oldest ones will be deleted at first could be then User could be users with more that have instances with more priority than others However, what we are doing now is the simplest case. So we'll be completely random It's a preemptible instance. The SLA is very low. You know that could be deleted anytime. So we're gonna start deleting randomly Printable, but it's a it's a very it's a very good question And in the future what we expect is to add these policies and maybe this is a plug-in that you can add your own policy Uh Yes, in the in the link that I show you of our prototype. Yes, that is the idea initially In future we expect to the to change that and you are a user that you have permission to do this You can create spot instances Yes, so that the It's really just a pragmatic choice right now So that there is a bonus by doing that in that the quotas of per project So actually we don't need to worry about quotas for preemptibles versus other instances You can just do that by launching your instance in the other project Network permissions aside it It seems like a good pragmatic choice. There's some work now quotas per resource provider and that will help to to move forward there to Everyone use a principle instances in different projects. Yes So we should remember to repeat the question When you kill the instance does it ever get restarted and right now that basically the on kill it's terminate We were talking so there's a forum session on preemptible instances. One of the things we tried to I don't like the name. I need a better name one of the things we were talking through was What's the relationship? between Open stack and this and the server when you delete it What she's likely to happen is we'll do a soft shutdown and give this give the VM or will bear metal for that instance About 30 seconds to go and then issue the hard shut down if it's not shut itself down So that's the kind of relationship. So there's no coming back from the dead in that sense It's actually just killing the Killing the server Oh, there's a question over there Yeah, it's so the question was is what happens with the quota That's a bad summary of the question But when you when you when you start the preemptible instances because it's in a separate project the quotas Just being used quota in that separate project So so the projects that you're launching the preemptibles and could almost have unlimited quota if you're okay with that but basically the quota in that preemptible project is the Effectively it's limiting the concurrent number of preemptibles you can have if that makes sense That'll be a great addition to a preemptible instances in your normal project But for now to be easy to Have a prototype on this for concept is better to a separate project Yeah, I mean, there's a lot of different open questions. I think all the way through this So I think the real plan here is that let's get something working end-to-end and we've got pretty Right now really it's just like what's the minimal Integration slash changes is an over to make this whole thing work end-to-end and once we get this in production I think the limitations of exactly where you know, the problems are will lie Because inside this we're when we're discussing this is because you're saying it's the quota You know the instances are in this particular project We can actually use that project ID to collect to get the information from placement to work out What the candidates are to be deleted because obviously when you're trying to pick what to delete You have to like find out what all the preemptible instances are in the system And right now, you know a list of project IDs is an API we have that we can do that It's a an intensive operation because you're doing it for quite a few different projects, but it's it's doable So once we've got all those in place, we can then talk about, you know, how do we optimize each of those pieces? Assuming it works out and you know, people are happy with it Yeah, but it will be a great tool to get to your requirements. So as John already said, there is the we add the other path for a session We have the other path if you can go there and put your requirements What do you think will be the next step be great? Yeah, the network ownership is the thing that I'm a bit concerned about Well, then whenever we ask people about is that a problem? We just get a general So, you know good to know if that's not the case Yeah, yeah, I mean there is quota still I guess so it doesn't have to be unlimited you could just say this user can only ever launch 10 Well, we are talking about only one project But you actually you can have several projects for preemptibles and only and actually have a finite quota for these preemptible projects Yeah So your user will not be able to to go over that limit that you define Wait, so the quota is just a limit like it just like you know, you can have seven instances in this project So you can you can launch seven preemptibles, but if you launch the 8th PM till all it won't it'll fail So you can only do so much so I said so one thing was when we're talking about this originally one concern was Can I launch some? Instances that make other people get killed and then go again and then sort of just create havoc and the key thing to realize with this is that You don't preempt preemptible instances If that makes sense, so when you're trying to launch your preemptibles They're basically looking for free space and all the ones that are sort of paid for for a better description using regular quota They're the ones that all kill a preemptible So if you like if you have lots of paid for quota and you're creating loads and deleting loads You would kick out preemptibles, but that's the whole point of the system. So it's not you can't It's not as attackable as it first seems because of that split Although we could be wrong so you know Hi Oh cool, okay So the question is going back to the the ska Work we've got infini band there and the question is is it is it multi-tenant normal? So so currently we're not doing Multi-tenant infini band there because of the it's it's kind of a special purpose thing that We don't think that's a problem. Now. One of the things we might find in the prototype is that Because of the chattiness between certain things we might want to use partition keys to help with that I don't know if that's really a thing or not. We'll find out um, generally speaking We've been working with other customers that have been doing multi-tenanted infini band Uh, and there's been some interesting Conversations around how we do that with secure hosts and other things Um, we should definitely have that question. But yeah, it's uh, there's some interesting options there Sean did you have a question? so, um Sean said with a glint in his eye There's a pending state here. Can I do it? Can I do fun things with it? Uh, the answer is yes. That's actually intentional um, so The pieces we're actually adding to nova aren't Special purpose as such for preemptibles. So the basic idea is when you have your One nova operator's favorite error no valid host Um, basically what this is is when you hit no valid host you go to pending state rather than error And then somebody has to do something else So the how you can get out of pending state is two ways you do the rebuild If it's if we found a space or anyone could do that And or you do a, um, a present state Yeah reset state to go back to error basically so A thing that I thought about also in a similar way is that you know You can actually use it to effectively cube things that you just don't have room for right now So as a user if you had that turned on you could just have a whole load of things in Your sort of like pending state and just to rebuild until the space for it Or you can have someone, you know, you can always use it as a queue effectively um So yeah, there's some interesting stuff you could do the notification Uh for context that we're waiting on is just a regular nova notification to anyone with the notification bus can wait on We had some discussions about that in the forum. It's probably changing to be the, um Select destinations notification. That's not been versioned objected yet So when we do that we can get the information in that we need So the question was how do we expose storage to containers running on bare metal? Uh, so I didn't go into too many details on that We're actually one of the ways we're adding the sef storage in is we actually use manila to manage the sef fs So and we've got basically ansible scripts that scrape Manila's api to actually get the mount point So we're just doing a regular mount inside the bare metal instance To bring the storage in And once you've got those mounts on your host system, it's just regular Um, whatever your co is but I mean right now we're targeting a swarm more than kubernetes But both are supported. It's just a regular amount into your container as appropriate Magnum's not involved in this really in the current way. We're doing it at least So the big county down clock is almost going to the red light So I think that's probably the end of the questions, but yeah Thank you everyone very much. Thank you