 Hi, everybody. Welcome to our talk about how the CKI team, which I'm part of, is using spot instances. So the title was made at the time when I thought I'm going to give an introduction in many ways of how we would launch spot instances. But then after time, when I actually started to prepare the slides, it turned out that there are not so many ways to actually launch spot instances correctly. So the title might be slightly misleading. But maybe it's interesting, right? So I'm going to talk a bit about why the kernel CI environment actually uses spot instances, what it looks like when they fail in various interesting ways. Then because we just didn't know anything really about them, some of the things we learned while investigating the underlying issues. And then what we did to improve things so that they don't fail, right? So let's start with an introduction. So I'm part of the Continuous Kernel Integration Team, which is a team that provides pipelines, CI pipelines, as a service to the internal and external kernel development community. So the high level goal is that we try to prevent bugs from actually entering the upstream kernel by testing as early as possible. But a lot of our time is actually spent testing during the real development process. So nowadays, kernel developers sit on gitlab.com, like the internal ones, not the upstream ones. But the internal kernel developers, they sit at gitlab.com, they file merge requests, and they get pipelines. And CKI is responsible for keeping track of the system and making sure it runs. So I linked the project documentation. All our code is on gitlab.com. It's open. So you can take a look. You can send merge requests if you want to contribute anything. But yeah. So our main job that we figured after a while is actually that we run this service. And the two things we mainly do is we make sure that we can build the kernel or that the kernel for these merge requests is built. And we use spot instance for that. So that's what the talk is going to be about. And then we also test the kernel or these merge requests, or also the RPMs that are finally getting built depending on what changes are there, trying to gate stuff so that we don't break composers, those kind of things. And I know there's some improvement there, especially around breaking composers and stuff. More than that. Yeah. So for the building part, what this breaks down to is that we built all the four supported internal architectures like x86, ARM, PowerPC, and S390x. And that comes down to about 300 hours a day of building pipelines that are spent building draft versions in merge requests that then also are getting handed to QE for testing, those kind of things. And because this is the kernel, these are pretty beefy instances. So they have 16 CPUs, 32 gigs of RAM, a local SSD for all the temporary storage during the RPM build. And these things are not cheap. Like these things cost in general about $0.70 an hour, like US dollar cents. US cents doesn't matter too much. So we try to optimize that. And that's why we use spot instance in this case. And at the bottom, you see a plot of the hours. So you see this nice pattern of the last couple of months where we can see that kernel developers are actually doing stuff like proposed merge request. And the workflow is in a place where they actually really use that. Like before, they would send a patch to a mailing list. They would be required to build it locally. They would send a scratch build to Boo to get it built. But nowadays, it's basically git push and boom, there comes a pipeline along. And this is what the pipeline looks like. So this is the general CKI pipeline, which after on the left in the prepare stage, you get the usual thing you would expect. Like there's some merging going on where the merge request is merged into the target branch. So you get like a merge result pipeline. Then it's getting built, which is the interesting part I'm going to talk about. We are publishing repo files. So QE can actually test these things. And then we also tested ourselves. So there's some automated testing going on in B-curve, which is a couple of hours on all of these architectures, making short boots, LTP. If it's a storage change, there's a storage test running, those kind of things. So there is automated testing. And it could all be better, but it's also not bad. And the way this actually works for the building part is that we use native GitLab CI pipelines. So I'm not sure how many of you used GitLab CI? Couple? Yeah. How many used GitHub actions? About the same. OK, we are getting there. So the way this works on the GitLab side of things is that there's a job, which is basically an expanded JavaScript. And it gets handed to a piece that's called GitLab Runner, which is a service running on some machine. And what this thing normally does is spin up Docker containers somewhere, but let's say on the host. And then it executes the JavaScript on this container. And that's basically it, right? There's magic involved, but this is basically it. But now, if you spin local Docker containers, you are constrained by the host. So if you want to have dynamic scaling by using something like OpenStack or Hyperscaler, there's a piece involved that's called Docker Machine, which is an interesting piece of software that's abandoned upstream. So it was developed by the Docker folks, which creates a VM on a Hyperscaler, basically, just calls the API that I don't know whether any of you were on the Kube world talk before, but they gave a nice list what this involves. It's not a lot. And then it installs Docker on this newly created machine. And then it forwards the Docker port of the Docker on this VM to the local machine where GitLab Runner runs. And then it basically talks to the remote Docker daemon along the socket. So that's all there is. And there's basically no difference between a Docker container running locally or a Docker container running on this newly created VM. So this is a pretty, I find, a pretty simple, nicely working system. Now, it was not the direction that the Docker folks wanted to take the whole thing. So that's basically why it's abandoned. GitLab forked it to be able to give it updates. And if you drill down into the Golang source code that it is at the end, there's basically one API call, which is called RequestSpotInstances. And the way this is normally configured is that you give it an instance type and you give it a data center where you want to have your machine created. That's basically, like you say, give me this machine in this data center and boom, there it comes. And that's all there is. And to say, before we used AWS for SpotInstances, we used OpenShift cluster. So Red Hat has some interesting, huge OpenShift clusters internally that have huge nodes, PSI, OpenShift. I don't know. People might know it. And they have nodes of 128 cores. So you can actually spin these containers on an OpenShift cluster as well. But a cluster is in some way static. So I mean, you have a certain amount of nodes that might scale up. They normally don't scale down. And if it's a bare metal cluster, I mean, these nodes, they are bare or they are not, right? So switching to something like AWS provides for a more, far more efficient environment where you only get these machines when you need them. Like if American current developers wake up, they will do it. And at night, nobody's running any of those pipelines normally. So the question was, are we using cost compilation or native compilation? It depends. So we are doing native compiles on ARM64 and x86, nowadays on AWS. So we're using the native machines there. For S390X and PowerPC, we are mostly cross-compiling them. And then we are natively compiling some pieces that don't want to be cross-compiled. So there are some interesting aspects to BPF or whatever. Yeah, some pieces don't like to be cross-compiled. Officially, the official RPMs built on Boo, they are always built natively. But for our CI pipelines, where it's really the point of actually providing results fast that are good enough. And there have been discussions where this breaks down. But nearly all the cross-compilers work, they are supported by the compiler team. So this is actually a supported workflow for this case. But for the official RPMs, they are natively compiled. So this works nicely. And we were really proud of ourselves. We moved off the OpenShift clusters that were failing once in a while that had scalability issues. And we moved to AWS and it's stable and it's a hyperscaler. And it's all dandy, right? Like that was two years ago. And it worked nicely until the end of last year. Right before the holiday season, beginning of December, our jobs started to fail. And so this is what it looks like when a GitLab job fails. So normally, there's stuff up here. So it starts at line 104. So there's further things up there that actually work. And at a certain moment, if you get a message and it tells you, I cannot connect to the Docker daemon running on this VM. And the reason it can't connect to it is that this VM shuts down, right? Like, if it shuts down, there's no Docker daemon. So, yeah, basically it fails. And now these are jobs that take half an hour. So if you have this thing running for 20 minutes and then it shuts down and then you repeat the job and you repeat it automatically, but you just, you waste 20 minutes of 70 cents an hour. And if you do this in parallel, that's actually what we're going for. And especially for the kernel developers, that's affecting their workflow, right? Like, it's, these failures are annoying at best. And they're not just annoying, but they're really, actually really breaking the process. So this is first of December, somewhere around noon, Americans tend to wake up around the time, do a git push or something on their merge request. And then it just explodes, right? Like, these are within the time of 15 minutes, like huge loads of jobs just failing. And we iteratively, or we continuously restart them on an exponential scale so they get restarted directly and then after 10 minutes and then after 30 minutes. And so basically it's already on the 10 minute mark. So it tried a couple of times already and it just can't get anything done. So this is what started to happen beginning of December, like when everybody goes on PDO already. And basically it says 30% of the jobs failing on a daily basis, but actually what it meant was that during the American workday, nothing would actually compile. And so we started to dig down into it. And if you look at the logs on the runners, you find two error messages. And one is the first one, which is unfulfillable capacity, where AWS helpfully tells you that there's not enough capacity available to match your request. This is what you get if you do a spot request. Doesn't matter what you put in as a prize, basically, I mean, first they take them away from you and then they prevent you from spinning more, right? Like, so this is the first thing you get and I don't know what the sun knows them for the image builder team, but this is what you normally see when spot instances are not available anymore. But then we thought, I have spot instances, just let's just throw money at it. It's Christmas, holiday season, we have some cloud budget left, and we switched them to on-demand instances, which are the normal ones. And then helpfully, AWS came back with a different error message. In this case, it was insufficient instance capacity, and it was also the first time we've seen that. And they are actually more specific. They tell you the instance type you asked for is not available in this data center availability zone, and then it lies to you because it tells you that they are working on increasing capacity. They are not, right? It's just, no, obviously not, right? So that was not helpful. And then we kind of panicked and we started to switch instance types around and we kind of recovered a bit in panic mode. And that was the moment when we kind of figured that we just didn't know anything about spot instances and how this all worked, right? Because this was not supposed to happen, right? It's a hyperscaler, right? You ask, you get, right? Like it's capitalism, right? That's how it works. So money in some direction, and then supply follows, right? Like that's just how the world works on this side of the world, at least, right? So what is this actually? Like what is the cloud? What's the hyperscaler? Now, if you ask Wikipedia, it tells you that a hyperscaler or hyperscaling is the ability to scale with demand. Now, obviously that's a lie, right? Like because it did not, right? Like a hyperscaler did not hyperscale because we asked, we had demand, but they were not able to deliver. And then going to the specifics of AWS Cloud, looking into various sources on the internet, you find something that they have like about 30 regions, let's say cities, something like that. And then they have data centers, availability zones in their most availability zone is a data center. And each of these data centers has a couple of servers, right? Like above 50,000 per data center and these are, I think, big hyperscales. I'm not sure, yeah. David Duncan is not here, but I've seen him. I feared he would be here and would just tell me I'm full of shit. So that's my interpretation, right? Like I haven't talked to anybody on the AWS side. So what is AWS EC2 then? That's AWS renting you compute. So you can go there and say like, I want to have a server and they give it to you. And you, depending on how much money you want to spend, you can get different guarantees from them. So you rent the machine on the go, pay per minute, that's on demand. You can get a spot instance, that's what we are using. These are cheaper for various reasons. The main reason is that AWS can take them away from you. Again, when it decides for whatever reasons that it needs to. You can also go down to savings plans and reserved instances where you commit to use a certain amount of compute or a certain amount of compute per instance type. So that gives AWS some way of actually figuring out how much demand there normally is going to be. And we felt good about ourselves until this happened because we followed the recommended strategy of saying like, well, it's steady workloads. We do reserved instances, savings plans, those kind of things. Interruptible workloads, spot instances for all the rest, bursty stuff on demand. So that's what we did, right? Like that didn't help now. So we read a lot of stuff and we still didn't know what is actually going on. Now, on the word for the interruptible, that was something new that we didn't know is actually that AWS tells you a bit more about what it actually means. So you get a spot instance and it can be interrupted. You get two minutes, which is good enough for shutdown, but not much else. But they actually also tell you what the chance is that your instance gets terminated. So there's a helpful table and there's an API call and you can actually ask them how likely, if I ask for this instance type, is it that you will take it away from me within a couple of hours or whatever. And normally this is below 5%, so basically there's enough. But it goes up to 30 and further. So if you ask for the wrong stuff, you will actually have a very high chance that stuff terminates within five minutes, 10 minutes, a minute, it might actually just be really impossible to use as a spot instance. And then there's a bit, the problem of the whole thing. So there are real data centers somewhere, right, like this is not the cloud, this is a data center and it has servers and it's cooling as power. And these instance families are real machines, they're real hyperscalers sitting somewhere and they have a certain number of them, right? Like they don't scale, there's no scaling of a rack, right, like it's just there or it's not. And to give you an instance, when you request one, it needs to be there. If you want to request a non-demand instance, it needs to be there in this data center. And they can only give it to you if it's not used by somebody else. And this spot instance is kind of like buffer this range of instances that are available, but nobody's using yet, right? Like so basically you can use them until somebody wants them. But obviously there comes this moment when there's just nothing there anymore and then even on demand requests fail. And now, actually spot instances change in price and this is the plot at the bottom. This is for the last three months. I was too late with making my slides so I don't have any plots of the time when this all happened, right? But you can see at the bottom you have the black line which is the cost of some instance, pretty cheap. And you can see how the spot instances which are the colored lines at the bottom actually go up over time. Different availability zones, different data centers and they all kind of like follow the same trajectory. So somebody was asking for spot instances, they got more expensive. But then what we did actually was not in any way taking that into account, right? Like we just asked for this one specific instance type in this one specific data center. So there was actually nothing AWS could have done to help us, right? We just asked for the server and there was none. And this is basically what these messages mean. It means we don't have these machines available and yeah, okay, then a hyperscaler is also just bound to reality. So we were thinking about like what to do, right? Like we are engineers, what do we do if we have a problem? We add another script, another tooling, whatever. Our idea was to find out how can we specify multiple instance types? How can we specify multiple data centers? Can we maybe pull these APIs and figure out what to do about it? Maybe we can have something automatic that switches it around and stuff. And then somebody not known to the world of engineering suggested we should actually read the documentation, right? Crazy people, right? Like that's just not what happens normally, right? If you don't just read documentation. But yeah, okay, we did read, yeah. So we read the documentation and there are about five different ways to create instances in AWS, right? Like historical reasons, most likely, right? Like five different ones, right? And they are in the table. On the left, the most well-known is run instances and we've seen it in theCUBE, we've talked before. That's the one you find all over the place, right? It's the one you, it's named after run instances. Perfect, it's not recommended, right? Obviously not. And then there are two right below that have spot in their name. They give you spot instances. Run instances also can give you spot instances. But the other two that have spot in their name, they also, they give you spot instances. You're also not supposed to use them, right? Like AWS tells you, don't use them. Just, they are named like that, but you should not use them. What they tell you to use are the two at the bottom and one is called create fleet. So this sounds like, I don't know, ocean warfare or something, right? Like, I mean, wait, okay. And the other one is auto-scaling, create auto-scaling coop. I mean, I don't want a coop, I like it shouldn't auto-scale, I just want a machine. But that's the other one, right? Like they say, take these two. Okay. Which we did, obviously not, right? So that we figured we were, it was our problem, we knew that before. And so in the middle, reading a bit of the documentation about them, looking into what we actually wanted to do, which was like specify multiple instance types, specify multiple data senders. The three at the bottom actually give that to you. So they allow you to specify this in the API call. So you don't need any custom tooling. It's a sad world, right? Like you can't write code, you just need to change this API code. But okay. So the two rows still look good. And if you look like into the purpose we have, we want to spin one machine temporarily and then shut it down again. They are, it depends a bit how hard they are to use. And so create fleet is actually the one you want to use. You might not know, right? Like we did not. The internet also doesn't know. I don't know what chat GPT would tell you. So I mean, somebody can ask, it would be something if you're abroad, right? Like ask chat GPT what it would recommend to create an instance on AWS. Would be interesting, right? Like maybe it has catch on, but maybe not. So we used requests for instances. We should have used create fleet. That's what we would want to use. There are a couple of differences. I would just mention them. One is that request spot instances is simple. You give it something you want to launch, four CPUs, 16 gigs of RAM. It will give it to you. And it can also use something that's called a launch template, which is like a template where you can specify something that you can reuse and kind of like not have to specify it over and over again. That's basically all there is to request spot instances and create fleet has a couple of more cool features. One of them is that you can vary this launch template. You can say like, maybe do it in data center A, B or C. It's your call AWS. You can be more generic for instance requirements. You don't have to say like, give me this specific instance type like this magic number that nobody can decode. But you can actually tell it, give me something with four CPUs or more. And it will do that. And the last part is the allocation strategy where you can tell AWS by looking for an instance, please look that this instance is cheap and available, highly available, something that will not get killed. So create fleet is better, right? Like AWS right, most of them, they're right, most of the time so. And it's not hard, right? Like it's actually not hard. The sad thing about the whole story is that it's actually simple to fix. So if you want to create a launch template, which we did not, you specify this YAML file converted to JSON and feed it into the API call at the bottom. So there's an image ID, you give it the name, the image you want to boot. Secure shell key and then you say like, do we have here 16 CPUs, 32 gigs of RAM? Please don't use your really, really old instances that don't boot on relative anymore. And put a local SSD on those things. And they will give it, it's perfect, right? Like it's great, creates a template. And now if you look into it, there are actually 11 different instance types available, right? Like we asked for one, but AWS says, we have 11 of those, right? Like just, right? Just interesting, right? Like we didn't know, no. Yeah, anyway. And so the second part is that you now need to create this instance using create fleet. So you say like, I want to have an instance, instant instance, like now, please. Should be spot instance and spawn it into one of these subnets that are available. And that's it, why do you do this? You get an instance ID, it all works, right? It is that simple. So this was kind of like amazing in some way. Now the problem was, yeah, we didn't read the documentation, obviously. We still don't read the documentation, but we know a bit more about the AWS API and they do stuff to it, so like stuff like these allocation strategies, this is pretty new. Use create fleet, I can just implore you to do this or use order scaling, order scaling groups. This price capacity optimized strategy is basically what you want to use, like something that will not get killed and is cheap. Problem is nobody uses create fleet now, right? Like it's not something that is very common yet because everybody gets told use run instances. And so the thing we use, like Gitlaprano uses Docker machine does not support this because it's abandoned upstream. So what we did in the weeks towards the holidays last year was basically implement support for that. We forked it, I learned Golang, which was interesting. We put this in there and since then we've never had any troubles anymore with spot instances. So there's a happy ending to the story of CKI against the failing spot instances, right? So it all worked out. It was interesting. It was an interesting exploration and I thought that was a good reason to share it with you. So that's it. Do we have any questions? Yeah, so the questions, did it result in cost savings actually using this new API call? Not really that much because these things are mostly similarly expensive. But yet it was a bit cheaper than retrying the same jobs over and over again and failing over and over again. So we were just wasting a lot of money trying to get stuff done but eventually not actually getting it done. So the instances we might have gotten now they might actually be more expensive but because we only run jobs once in total it's I think similarly expensive as before but far more reliable. Yeah, so the question is are you able to restrict regions? Yeah, so this API calls are very specific to regions normally so you actually say in this region and then in one of the data centers but this is within a city basically. So this is like in the Amazon world a region is let's say a city and then availability zone is something like a data center. Yeah, so the question is is it possible to say like I don't care? No, it's not possible to say that and you also don't want to because one of the most expensive things in AWS is data transfer between various things. They are mostly free if you stay within a region especially S3, if it's all in one region it's all good but as soon as you move one of those out it gets really expensive. Yeah. How does this work because beakers go in and out? Yeah, so the question is how does this interact then eventually with Beaker where we are going to test these corners? So we are building them in AWS we are pushing them mostly to S3 buckets in AWS and then because transfer is expensive transferring it into the internet of Reddit is expensive we'd only do this once we copy them into an internal S3 bucket and then Beaker boots of a compose something internal adds the repository with these custom kernels, installs it, boots it checks that it's actually the one that's booted and then the testing progresses. Yeah, so yeah, kernels are just normal RPMs this is just a default repo and we tell it test this kernel and we make sure that it's actually also booted because that's kind of interesting. Yeah. So the question is if we try to actually limit the amount of jobs that it would spawn yes, we tried, but event like very, very fast it was actually the case that it would just not spawn any instances. So it was a single data center that was used by others because our load did not really increase over these time periods. Some other pesky customers came in and took our instances. So actually we were on the spot where we could not spawn any of these machines and our theory is that because we require these local SSDs and we think that there's a rack and a couple of SSDs that can be attached and if somebody else wants them I think there are far less SSDs than VMs that you can put on there. So if there's no SSD there might be compute but there's no SSD so you don't get any instance. So the question is did we actually observe any like intermittent patterns of if there's somebody else needs spot instances? Not really so we did not really figure out what was actually going on at the time but something unrelated is that in the last couple of months since about March spot instances have gotten far more expensive than before they settled around 30, 40% of the on-demand price and now in a lot of regions in a lot of data centers the prices have actually approached the on-demand instances people have looked into this and it corresponds with a higher rate of interruption so it's not just Amazon being greedy which might be something you would think but it seems like somebody or maybe because of cost savings people move more to spot instances so there's more demand for spot instances and so kind of like Amazon I think the algorithm in there tries to provide a signal for people to know oh don't use those types because we don't have enough of those and I think there's something happening in the background but if you don't use create fleet you don't react to the price indicator I think AWS wants you to actually use these API calls so they can distribute load across data centers across instance types but nobody uses the API calls so it doesn't work for them or not good. Yeah so the question is did we actually look into multiple regions? Yes we thought about it so there's we thought about multiple availability zones now these our like the networks we spawned some of them have internet connections that they are limited to certain data centers at the moment so that's one of the things we would want to enable spawn across multiple data centers spawning across multiple regions is interesting because as I said data transfer costs actually factor in quite a bit and so we would need to have S3 buckets in these other regions there's a transit gateway into the Reddit internet that would need to be provided there and also we have an open shift cluster that does a lot of the magic behind also what goes into the data and this one also runs in the same region so around data transfer costs this design kind of makes you not care which is bad in some way because it's cheap right like these jobs pull gigabytes of cash files from S3 it's free right like it's a very efficient way it's cloud native whatever but if you want to split this across regions or move to another region it would get prohibitively expensive so it's we thought about it but we didn't see any way yet to do this so yeah so the final answer so the question is like they've seen the same problem I don't know what team you are on but so other people have seen it at the same moment in time in US East 1 which is like the most popular one right we are not planning to move to other regions so you're free to move there I know that Reddit has internet connections in other regions as well but I mean if you have better experiences there maybe we'll follow you maybe but yeah it's all still right like it's okay okay so yeah that's one of those okay so the remark is that actually it might not be related to the SSDs because they haven't used SSDs on their instances but the problem was very much the same okay thank you very much