 My name is Kyle. I'm from Yelp. I'm an SRE. I'm here to share with you how Yelp runs our Mesa's clusters on Amazon Spot Fleet. And we do save a lot of money by doing this, but it comes with some certain risks and requires some certain engineering effort to make it work, and that's what I'm going to be sharing with you. Before I get too much into this, can I have a show of hands of those who are in the room who have heard and they know what Spot or Spot Fleet is with the Amazon offerings? Wow, that's most of you. Okay, for the camera, almost everyone raised their hand. That's a good sign that everyone here... I won't have to define too many terms, but I'll go ahead and go through the quick explanations regardless. For those in the room who are interested in running their production workloads on Spot Fleet, there should be some gold mine. This should be a gold mine of best practices, I hope. And of course, afterwards, I'm happy to share with more details about what's going on behind the scenes with you if that can help you run your workloads on Spot Fleet. If you haven't heard of Yelp, Yelp is an app and a website to help connect people with great local businesses. It can help you find maybe a good restaurant to eat after your Mesa's Con, or maybe help you get a quote from a plumber to help fix a leak, something like that. First, I'm going to go through what Spot and Spot is really fast, and then I'm going to describe how we manage Spot Fleet at Yelp, and then share some of the best practices we've learned over the years. And then I'll show you the graphs about why we'd go through all this work, how much money we actually save in real life. Spot Fleet, definitions. I'm not going to spend all the time defining all these terms. There's a certain class of person who really likes it when people define exactly what they mean with stuff. Sorry, I don't have time. I'm glad that most people in the room do already know what Spot Fleet is. I'm not going to go through all these things, but it is important that when I talk about instance types, you know that these are the certain classes of EC2 servers. And when I talk about availability zones, these are special data center-style separations of Amazon regions. I'm going to be referencing reserved instances. This is where you pay Amazon in advance for a few years to have a server that is reserved for you. And I'm also going to be talking about the on-demand price. This is the price that you would pay if you just went to the Amazon console right now, opened up a new account and just launched a server. That's like the list price of a class of a server. So let's do a quick demo of what Spot instances are and how they work. Spot instances are servers that you can pay, that you bid on, and the supply comes from all of the extra servers that Amazon has that have been reserved, but they're not in use. So these are customers who have paid Amazon in advance for these servers for a couple of years, but they're not using that the second. So Amazon is like to recoup some, you know, use it as they're just sitting there idle. So they're willing to sell them, but you can't have them forever. If Amazon wants, if the customer who reserved them wants them back, you get two minutes to get evicted from that server and it goes back to that original customer. So the way that Amazon sells these is with a bidding process. You can see on the left-hand side, there's bids from some theoretical customers who are interested in buying these Spot instances. And then we have on the right-hand side the supply, the servers that are actually available in this pool of resources. The Spot price is the cheapest, the most inexpensive bid that gets fulfilled. So in this case, the current Spot price of this theoretical thing is $3. Let's go forward in time and let's say that somebody had shut down a reserved instance, and now there's some new supply. Customer C, who asked for two servers, is now getting a third, a second one. Got fulfilled. Let's go forward in time. Let's imagine that somebody had stopped another instance and that was even more capacity. Now there's another Spot bidder who got fulfilled where they weren't being fulfilled before, but the price that everyone's paying has gone down. Now it's gone down to $2. So even though everybody's bidding high amounts, the price that they pay is the $2. That's cool. Uh-oh, everybody recalled their servers kind of all at once. Maybe a customer, they got some more load and they increased the capacity in their auto-scaling group. These customers now get up-bid. And the only person who didn't get up-bid was bidding the highest, which is $5. So the current spot price is $5, and these other customers had to be evicted from their servers in two minutes. This is the quick overview of what spot instances are and how they work. The price can go higher than the on-demand price. There's no actual... I think the hard-coded limit might be something astronomical, but you can bid even higher than this and get, you know, to increase the likelihood of you getting not out-bid. So that's the quick overview of what spot instances are. Spot fleet. Spot fleet is an offering from Amazon that gives you a method to launch spot instances... where a spot fleet launches instances for you on your behalf. It can respond to out-bid events by launching alternative instances and other ease to keep your overall capacity up. So this is somebody just buying new spot instances, so when you get out-bid, it'll launch a different one in this place. So let's look how quickly... Let's quickly look how that goes. So let's say that you were out-bid on a particular... like a 4XL. Amazon spot fleet might launch two 2XLs in its place because they're kind of equivalent. You kind of let Amazon know that it's close enough. Let's say that you got out-bid on all the 1XLs across all the zones. Maybe somebody just went on a shopping spree and just wanted all the 1XLs. And you got out-bid. Amazon spot fleet might launch two 2XLs in an alternate zone if they were available. So that's Amazon spot fleet. Spot fleet is that control system that Amazon provides to give you a bunch of spot instances and it tries to maintain a certain amount of capacity. This is really important for launching production workloads like Yelp's website because we need capacity around the website. It's not like a batch workload where if we got out-bid, it's okay, it'll finish eventually. The website doesn't... The website can't stop. So we really need this. It's important that users are able to describe exactly what kind of instances are suitable to run their workloads. And I'll show you examples of how we do that at Yelp. But it comes down to a big declaration of, I can do C44XLs and 10XLs and you describe all of these different instance types. The spot fleet service is what's doing the launching for you. This is really important. You don't have to be at your laptop launching new things and doing things when things go out-bid. Spot fleet is doing this for you. So that's their value offering. Now, let's move to how we manage spot fleets. Let's take a quick look at the web interface for managing spot fleets. If you go to Amazon's website right now and you go to the console and you go to launch a spot fleet, it looks like this. I hope I don't have to sell everyone in the room that infrastructure is cold. Code is the way of the future and that launching a spot fleet in this manner, although convenient and easy to get started, isn't really a long-term solution for running a website on. If you're going to make changes, you're going to launch another spot fleet, you're going to launch a new location, you need it to be reproducible and you don't want to have to write a big long how-to document about what buttons you clicked on this. Can we do any better? You could certainly use the CLI. Here is all of the text from the example spot fleet request from the CLI. This is, if you look in the docs, the recommended way to launch a spot fleet request via the CLI. And I'll tell you right now I personally find this not unacceptable. I think this is, it's too easy. I think there's, you read the docs and it says, oh, you should do X, Y, Z. But I'd like to, for those in the audience, to kind of expand your mind a little bit and think to yourself, is this really the best way to be doing this? Maybe they only have this in the documentation because they only went this far. They don't expect you to do this a lot or something, you know, something like that. But in practice, if you're going to be running a website on something, I don't want developers to have to worry about this or operations in yours, have to worry about this. Could we do even better? Oh, well, let me explain a little bit why I think specifically this is not a good idea. Let's take a snapshot of this JSON and just kind of like look at it really closely. Don't worry about the actual contents, but like all of these numbers, where are they coming from? What's magic numbers, magic numbers, P pairs, group IDs, security group IDs. Where's the subnet IDs? I had to look that up. Maybe they copy pasted. There's another duplicate number. I don't want any of these. Oh, and the biggest problem with this is that this JSON file, it's very cumbersome to write. If the, you'll see later, the key to making reliable production spotlight requests is diversity, but to do diversity with this, with this style means you have to make this JSON really, really big. That's really hard. So if you're making these things by hand, this style is really, it's forcing developers to do the wrong thing. It's not making it easy for them to do the right thing. So can we do even better? Well, I don't, at Yelp, we use Terraform. I no longer am a kind of a, I used to be kind of a Terraform kind of fanboy maybe, but I no longer do this. I really think that there's room for lots of tools. I'll just say this. There's a certain class of person who, when they see Terraform and they look at how it works, they think it's the best thing since sliced bread. And I was definitely in this category. There's some people who look at Terraform and they're like, oh, wow, this is scary. I'd rather just use the CLI. I think that's cool too. So my new thing is I say to everybody, why don't you look at Terraform and see just in case you're in that first class? Because if you're in the first class, it's seriously, your mind won't be blown. You'd be like, wow, this is really the best way to do it. If you're not, that's okay. But I'm going to show you how Yelp does it just because I'm on stage. Yelp, we built this module in Terraform to help reduce the complexity of launching spotlight requests. Again, the reason we do this is because we're running our website on it. We want to make it really easy to the right thing. We want to be able to reproduce it in multiple regions. We want to be able to iterate on it and make it better over time. Spotfully has a lot of rough edges. It's, I'm not, I don't feel embarrassed or anything to say that the Spotfully APIs is probably one of the most cumbersome APIs from Amazon that I've ever worked with. It's very difficult to work with. So I don't want to necessarily hide them from our operations engineers, but I do want to make them as easy to work with as possible. So that's why we use Terraform and that's why we use this provider. No magic numbers. The inputs have been reduced to only what you care about and I'll show you, we have a JSON file that we can reuse instead of copy, paste, and a lot. Here is how Yelp describes our Spotfully Request instance types. You'll notice in here there's nothing about security groups, VPC IDs, or anything like that. It's only about the instance types. We've separated this part from all of the parts that are region-specific. What runtime environment? Are you dev or you prod? Are you in what account number are you working with? So by separating them, we can iterate on just the part that we're interested in about what kind of compute we need and how much we're willing to pay for it. And then we just same like this file. It's that simple. So there's some good stuff in here. It's really easy to add to this file. And of course, that's making it, again, that's making it easy for our operations engineers to do the right thing, which is to increase the diversity which makes our spotlights more reliable. And through the magic of Terraform, if you make a change, Terraform keeps it in sync. That's the big selling point behind Terraform is it keeps you declaratively write your infrastructure and Terraform makes it happen. Reduced, duplicated data, all sorts of good things about having infrastructure as code and using Terraform in this way. Now, why did I take five minutes explaining why we use Terraform for launching spotlights requests? You could do everything without Terraform, everything would be fine. The reason is you're going to get it wrong. You know, we've gotten it wrong many, many, many times. We have to iterate. So we need to be getting, we need to get really good at recreating the spotlights request because things are going to change. You're going to change your policy and how much you're willing to pay for things. You're going to change it out a new instance type. You're going to add a new AZ. You're going to launch a new mesos cluster. You're going to upgrade your mesos clusters to lots of lots of things. So you want to get really good at it and you do not want recreating the spotlights request to be something that you're afraid of doing. You do not want recreating the spotlights request to be something that you're afraid that it's going to take down all of your infrastructure. You need confidence that you're going to be able to do this and do this well. So that's why I spent all this time explaining why we use Terraform. It's because it gives us the confidence to be able to reproduce these spotlights requests and not be afraid to adjust them. Let's take a look at those what this looks like. Here is a spot request before the big hump and then this is somebody running Terraform Apply, launching a new spot request. It comes up and the other one scales down and then afterwards it's been recreated. You can see it's slightly more diverse, which is good. This is an engineer improving our spotlights requests without downtime, which is a big deal. At Yelp, this is really easy to do thanks to the tooling we built with Terraform. Let's talk about best practices next. If we're going to build a spotlights request, how should we build it targeting production infrastructure Yelp.com, things that need to be up all the time, not batch workloads where a spot fleet is traditionally or at least designed to be used for? The number one thing you can do to make a production-ready spotlights request and run the infrastructure on it is diversification. Remember, when you get outbid on a particular instance type, the spot fleet mechanism launches different instances in a different location. By diversifying, you're expanding the number of possible things that spotlights could do on your behalf and you're reducing the blast radius of when you get outbid. Imagine if you launched a spotlights request and you said I can run in two AZs on C4XLs. That would only be in two spot fleet markets. If you get outbid, which you will, you're going to lose 50% of your cluster. That's no good. So diversification helps reduce that blast radius. It's just like getting a mutual fund or something that's a diversified investment, that kind of thing. You might think that this is as easy as picking the diversified strategy over the lowest price strategy in the spot fleet request API. That's certainly part of it. You absolutely should pick the diversified strategy, but you need to do more. You need to expand the types of servers you're willing to run on. This means you might have to make sacrifices about the ideal instance type you're willing to use. At Yelp, for a long time, we used the C4XL exclusively for our big web app. Sure, it gave the best performance, but what we could run on other instance types would be a little slower, but could we save a lot of money by giving it the freedom to pick different spot fleet types and instance types? We decided collectively, after running the numbers, seeing how our website performed on other instance types, that it was acceptable to make this cost-benefit analysis. That's what this really comes down to. It's a business decision. You're choosing a type of infrastructure that's less stable, but very cost-effective, and you're making the trade-offs between performance and reliability. Diversification, Amazon spot fleet does not retroactively try to rebalance your spot fleet to make it more diverse, even when you pick the diversified strategy. You need to be aware of this because over time, the spot fleet can become less diverse, and as it becomes less diverse, it means your risk is increasing. I'll show you graphs of that. A key fundamental understanding about how spot fleets are implemented is that the spot markets are a combination of AZ and instance type. So it's not enough that you just pick that you launch everything in one AZ with a bunch of different instance types. You need to realize that you need to be spread out across all those markets for good diversification. And waiting. This is really important because if you don't wait, which types of instance types are acceptable correctly and you give the spot fleet the wrong kind of signals, it may inadvertently launch a lot of maybe M416XLs because you've waited them really high. This would be bad because you want your diversity, you want your weights to represent a diverse and evenly well-balanced fleet that, again, reduces your risk if you're interested in running production workloads. Let's take a look at that diversity in action on the left-hand side. There's some lines, but you can see that we're really hot on one particular instance type. This spot fleet request was not configured correctly. And then when we built the new one, it gets a little more evenly balanced. I have a better example next here. Look at this spot line request. You can see over the course of a couple of days, we've made a mistake in our waiting and the spot fleet request has highly favored two particular instance types. We are at serious risk here. If we lost one of these instance types and got up bid, we would lose half of our cluster. You can see towards the end of the week here we've corrected this. And look how nice and rainbow-esque the last part is. That's because our weights are now correct and we're more evenly balanced and there's no hot spot in our spot fleet request. Pretty cool. This is how you do this. You add more entries to this JSON file if you're using the Terraform provider. Of course, if you're using the CLI, you're going to be adding lots and lots of stuff. But how do you achieve diversity? It's through conscious and deliberate effort and tuning and deciding which instances are correct for your workload. Let's talk about AZ balancing. Most people who are on Amazon who are running these kinds of things, websites, are using things called autoscaling groups where Amazon is launching instances usually of one type and you click the force AZ balance button which makes Amazon launch equal numbers and equal in different AZs. Spot fleet does not do this. It's not really designed to run websites on. It's really designed for batch workloads and that kind of thing. You have to take it into your own hands to be able to balance these AZs yourself. Well, how can you do that? Let's look at another graph. This is a spot request. It looks pretty well balanced, but let's look a little closer here. Watch what happens when I look at it on a per AZ basis. So I take out all the instance types. This is our capacity per AZ. Looks still not bad, but then when I unstack it, you can see that at times there's some serious imbalances here. Not only is this a little risky, but it also means that some of our developers, in particular our database engineers and our memcache clusters, can get hotspots. How do you eliminate this? One way that you can eliminate this is by forcing Spotfleet to only launch things in one AZ and instead of launching one mega Spotfleet, you launch, if there's three AZs, you launch three different Spotfleet requests and they're pinned to the AZ. So when you get Outbid, Amazon is forced to launch more capacity but in the same AZ because the Spotfleet request is self-contained. It has to launch new instances of a different type in that AZ in order to meet your capacity and it's not allowed to move capacity to a different AZ just because it might be cheaper. This is definitely a trade-off. Although there may be instances in other AZs that are less expensive that Spotfleet could save you money on, but if you care about AZ balance and you care about diversification and you want to make sure that you don't end up pile on one AZ and they, again, serious risk, you need to set up Spotfleet requests that are AZ-specific to set up hard boundaries to let Spotfleet, to make sure that Spotfleet can't start getting in balance. Let's talk about bidding. This is something, if you're interested in launching Spotfleet requests for a website, you're going to have to choose what kind of bids you pick. At first glance, you might think you should just bid high because you don't want to get Outbid. You want your website to stay up. You can't bid that high but you do want to bid high and you certainly want to get win over the batch workloads that other people are launching on Spotfleet and at Yelp, we've come to lose at 2X is pretty good. If you want to survive the spikes, there's small spikes in the price but you don't want the bid to go so high that you end up paying tons and tons of money where you should have just been launching things on on-demand the whole time. So remember, this is a business thing. We want to be able to save money and if we choose the wrong bid, we won't save money and then what's the point? How do we come to that conclusion? Well, we wrote this tool to help scrape the pricing structure and then analyze how many times would we get Outbid with a certain bid price? And conversely, how reliable would a piece of infrastructure be if we bid X? So by using this tool, we were able to explore the space and try to come up with general guidelines about how much we should be bidding on our Spotfleet requests. Here's an example. If you bid a certain... Basically, we asked the tool, how much would we have to bid to make this class of server be 100% up over the course of the month? And it just spit out that it recommended that we bid $10 an hour for an M4 10XL, whatever. The point is, by inspecting the data, we're able to kind of prove to ourselves or at least give us some confidence that if we bid high, we really will stay up. Similarly, in this example with a C4 4XL, if we were to bid the recommended amount that this tool says to stay 100% up, we'd be spending about $11... Well, we'd bid $11. But if you look carefully, the tool's telling us that we'd be spending more than what we would spend on a reserved instance. So again, if we're going to save money, we might as well be just spending... We might as well just buy reserved instances. There's no purpose in us spending this much money on a spot instance just for the sake of availability. So make sure to keep this in mind and you want to keep a balance. You want to save money, but you want to stay up and you've got lots of knobs to do it. And again, at Yelp 2XL, this amount will be decided as close enough. You will be up in. It's just the nature of the beast. If you're going to choose to run on this very unreliable infrastructure, you need to be able to deal with determinations. Now, in EC2 anyway, things get dropped on the floor. EC2 instances die. I remember the first time that I received a notification that my EC2 instance was on bad hardware. And I was like, oh, it's a cloud. It's not what I thought it was. Yes, EC2 instances go away, but they go away more frequently in spotler requests, of course. How do you deal with it? Well, luckily, in the Mesos land, this is very easy. Mesos has all the tools it needs to be able to move compute to different places. Super great. Can we do better than just letting Mesos reschedule things as it sees fit? We can. Let me show you a few examples of what's happened. So in this case, Mesos has done a great job of taking, of reacting to these OutBit events. In this case, we got OutBit on two spot markets at the same time, which represented maybe 20 or 30% of our capacity. The reason I was even able to find this example of us being OutBit is because it was a pretty big event. Luckily, it wasn't so big that it impacted our users because we're saving so much money on spotlight that we can overprovision, so we know to expect things like 30% cuts in our capacity. No problem. Let's just overprovision another 30%. We're saving so much cash. It's no big deal. But Mesos also is able to compensate. You can see only a few minutes after we were OutBit, we were able to relaunch on new instance types and Mesos made it happen. Pretty great. Here's one where it didn't go so great. At Yelp, we were OutBit. A couple things compounded here to make this event pretty bad. You can see that this is before we had really good best practices around waiting, so we were really, really hot on a couple AZs. And when those instance types in AZs and when those got OutBit, we lost a lot of capacity. It's about 50% in this graph, so that was pretty significant. Again, luckily, in this case, there was no user-facing impact because we were overvisioned. It wasn't super, super high impact time. Mesos did what it could, but somebody at the exact same time happened who had break. They broke Puppet. And when they broke Puppet, Puppet couldn't run. Puppet couldn't bootstrap and we couldn't launch new capacity. Spotlight was doing its job. Spotlight was launching plenty of instances in their place, but Mesos couldn't launch because our bootstrap was broken. You can see when the bootstrap was fixed exactly, Puppet started taking care of things, and Mesos recovered great. Mesos, again, is perfect tool for this job, and we were only able to run our production workloads on this kind of unstable infrastructure because of the way that Mesos is so good at launching, you know, things in new compute where they're available. But you can do a little better by using Mesos maintenance primitives. Unfortunately, these are only supported really in the HTTP API, so most frameworks do not support this. It's kind of a shame. I wish that this was... I hope that next year's Mesos Conf maintenance primitives are, you know, really well-adopted, but at the current state of the time of this talk, they just aren't. We use Marathon extensively, and Marathon, there's a pull request to do part of it, rejecting offers from agents that are maintenance. That's some of it. But we need the other half where Mesos will take things away from servers that are going down. The way that we do this is that you can curl this very specific URL on every EC2 server that's in Spotfleet, basically any EC2 server anyway, but Spotfleet specifically, and you can see when you're about to be outbid. You get two minutes to get out, to evict. So what we do is we just pull this URL, and when we notice that we're about to be outbid, we ask Mesos to go into maintenance mode. This is pretty easy because this is on the local host. So you can just, you know, query the API. When you get outbid, go into maintenance mode. Unfortunately, none of our frameworks support it, so we use our pasta as our platform as a service to kind of work around this, and we just ask Mesos and Marathon specifically. We ask Marathon to relocate tasks by downing them and killing them, and then it'll relaunch. We do our best. It's not perfect, but the tools are in place to be able to graceway handle this type of termination in Mesos using the primitives. However, the elephant in the room here is that two minutes is not very much time, so it maybe goes without saying that we don't run our workloads that take longer than two minutes to spin up on Spotfleet. Specifically, we don't run anything that's safeful. Of course, we don't run Cassandra on this. We don't run Kafka. But luckily, most of Yelp's compute workload is not that, so we don't mind paying reserve prices for those pieces of infrastructure that really need to be up most of the time, and launching this for pretty much everything else. Again, Mesos does a great job of tolerating this. I do have to say that at the time of this presentation, I can't actually recommend using the maintenance primitives for this because there's a critical bug. I think that most people just aren't using these primitives because they're not supported in their frameworks, but we can easily crash the Mesos master by hitting this maintenance API in the wrong way. So we actually don't do this in real life, to be honest. I hope that bug is fixed quickly, and then we can turn this back on to make our termination procedures even more graceful. So let's recap some general advice for those in the room that are really thinking about Spotfleet and how they might use it at their own company. Diversification is probably the number one best practice to take away from this. Diversify your Spotfleet as much as you can to reduce your risk to run your production workloads. Locking Spotfleets per AZ, I think this is a really good idea. And if you think that AZ-specific balancing is important, definitely do that. It's easy to do by launching Spotfleets per AZ. A Spot Market Mesos attribute is kind of cool. It's pretty easy to add attributes to Mesos arbitrary ones. And then you can have your frameworks interpret these in the way that they can to reduce your risk even further. If you set a Spot Market attribute like this, you can ask, say, Marathon to maybe group by certain things. So Marathon doesn't accidentally pile up the tasks on particular Spot Market types, Spot Market groups. And again, responding to maintenance events as best you can. ABS gives you the tooling to be able to curl that endpoint, do whatever you can to do with that. Yelp, that means taking it out of the low balancer, talking to Marathon to reschedule some stuff, do whatever you can do. Profit, is it worth it to Yelp to do all of this tooling, all of this instability? All the engineers required to keep this up and running? Well, I ran the numbers. I wanted to show my work. It's okay that you can't read all of the data here, but I promise that it, yes, the answer is absolutely yes. There are certain times where we are paying more than the reserve price. That's okay. On the whole, Spotfleet is saving us tons and tons of cash. Here's an example of that where there's a sustained period where we're paying more than the on-demand price, but again, it's okay. On the whole, we're still saving cash. And I wanted to make sure that this was actually true across all the instance types and across all of our regions because there could have been a case where we were, like there's one that's running really hot that we're like wasting money by launching a particular instance type. I'm happy to say the answer is no. When we look at across the board, all the regions, all the instance types that we're using, and then doing a weighted average based on the hours we're spending on each, I can with confidence say that we're saving 50% over the reserved price. Here's another way to visualize that data. You can see some hotspots that are coming up to the reserved price, not on-demand, but on the whole, we're saving 50% compared to our baseline reserved instance type, which is a three-year convertible. This is the type of reserved instance we would be paying at Yelp if we weren't using Spot. So this is the comparison that we have to make as a business. What would we be doing otherwise? It's not really fair to compare the Spot prices to on-demand for the business because that's not what we'd be actually paying. But compared to those prices, we're still saving 50%. To put another way, if we were to snap our fingers and stop using Spot fleet, we'd be paying twice as much for our infrastructure. This is a huge deal. How huge it is, there's really a function of how big your clusters are, how many engineers you are, and how big your AWS bill is. But if it's 2x, that's a big deal. So it is a lot of profit. I would like to shout out to the early adopters. These are the Yelp engineers who have early adopts spot in the development environments to save a lot of cash there. It wasn't until we learned a lot from them where we were able to move it to prod. I'm not an island here. I represent a team, and these are the team members that work with me to help make sure that this Spot fleet infrastructure stays up as well as it does. If you're thinking about going into Spot fleet, this is a must-read for you. Of course, look at the slides on the Mesa's Common website afterwards to see these. There's alternatives to Spot fleet that you can use that give you some of the same benefits and there's some hosted solutions. There's lots of alternatives. Before you get into this, you need to be well-educated. And also, there's some academic papers studying these markets to see how are they sustainable, can they work in the long term, and that sort of thing. I highly encourage you to look at these if you're actually going to go implement Spot. Learn as much as you can about how it actually works. And that's it. If you go to our GitHub, you can see Pasta, which is our Mesa's platform as a service thingy that has any of the code that I referenced in this thing if you're interested in the actual implementation. And of course, Yelp Engineering, hiring, blog, and all that stuff. And you can contact me personally with that email. We have a few months for questions. Microphone, would you like to? Or I can repeat the question. Sure. Question was, how much did Yelp have to spend in order to build up the infrastructure required to do this thing? That's really important for doing the kind of cost-benefit analysis. Hard to say. As you can see, it's probably the four engineers for Dev. So it's four engineers working for a couple of quarters, not working all full-time. But definitely at our scale, it was definitely worth it to do. I don't think we actually did a formal cost-benefit analysis, but looking at our Ares bill, it's obvious that it's worth it. Luckily, you don't have to spend all this engineering effort yourself. The Terraform provider, we open-source. We just made pull requests back on Terraform. The pasta infrastructure is all open-source. The auto-scaler and our AWS spot-aware auto-scaling is open-source as well. So luckily, you can learn from all the work that we did, spending hours looking at graphs and trying to figure out what the best practices are for using this tool by, of course, watching this talk. Maybe one more question? Okay, actually I have about seven questions. How do you change capacity of your fleets by hand? Or do you have some automated tools that tracks the usage of your mesos fleet and do it automatically at night, for example? Sir, you're asking why don't we... Okay, do you apply Terraform states by your hands or is it doing by some automated tool? The whole, the scale of the spot fleet? Yes, scale, downscale, or scale of your spot fleets? Yes, we have an auto-scaler in the spot fleet. Spot fleet, you can just like an auto-scaling group, you can tell it how big you want it to be. And if you look in the pasta code base, we have an auto-scaler that inspects the meso statistics and expands and contracts at will. Because we already have the maintenance period of stuff, or at least until recently we had it on, it wasn't, we didn't really care whether it was scaling down because our auto-scaler decided to scale it down or because Amazon decided that we ran out of spot, you know, we got outbid. You know, it's the same kind of event. So we were lucky to reuse that work. So whether we're scaling our cluster in or we're getting outbid, it's no big deal. Our auto-scaler has some special sauce though. We pick the instance type that has the least tasks on it. So we kind of, we have a sorting function that sorts things that are either not a mesos in the first place because they fail the bootstrap, maybe they don't have any tasks and we try to kill those off early to keep, try to preserve those things. I have long running tasks, you know, just do the best we can. Okay, do you exclude most unstable instance types from your fleets? Do you make some survey, for example? Yeah, we try to scale. We try to adjust our spotlight request to be able to run the largest thing that we try to run. So our largest application happens to be our big website. So the smallest instance type that we find acceptable is the kind that can run the website. But for everyone to have to make their own choices about what instance types are acceptable to run on their infrastructure. Okay. What is average real load of your machines in compared to reserved resources on mesos? I'm sorry. Okay. What is average load of your machines in spot fleets? Real, I mean. I'm sorry. I'm having trouble understanding over the accent. Okay. But would you mind, we're out of time. Let's talk offline and we can discuss more. Thank you for your time.