 Work as a production engineer at Mapbox where I build systems that scale and in a very small way sort of lay the technical foundation on which all of Mapbox's services run. Today I'm going to be talking about SpotSwap which is a module that we built in-house. It's also a very interesting architectural concept. We do use AWS across all our services so this talk can be a little heavy on the AWS jargon but I've tried to break this down wherever possible. So we run a host of services at Mapbox and one of these services is our map service called API Maps. Oh yeah one second. Yeah one of these is our mapping service called API Maps and on any given day this service serves about 1.5 million requests per minute served across six origin regions and these servers that serve these requests are sort of housed in autoscaling groups which is an AWS concept without I mean it's a little hard for me to get to the essence of my talk without giving you an abstraction of our architecture. So you can think of our architecture as being composed of three parts. Everything that happens from the time we receive a request until that request reaches a load balancer and then the load balancer which is denoted by an ALB in this diagram roots the request to a server in the autoscaling group and then the server retrieves whatever ingredients that it needs to fulfill that particular request but what is an autoscaling group? So you can think of an autoscaling group as being a group of servers and as the name suggests this group can scale up and down in capacity and you can configure the scaling based on three configuration parameters which is the minimum the desired and the maximum. So the minimum determines the least number of instances that your autoscaling group can have. So the least number of instances that you can scale down to the desired is the optimal number of instances that you would want during periods of normalcy and the maximum is the maximum number of instances that you can scale up to. The other point about the autoscaling groups is that these groups have servers that are very similar in nature so each server in the autoscaling group will have the same amount of CPU and the same amount of memory. Coming back to our architecture the focus of my talk is going to be the central portion or the autoscaling group and I'm going to talk a little bit about how we had to ensure that this autoscaling group was reliable so you know it always had to have a sufficient number of servers in order to fulfill the traffic capacity that we receive and to keep our latency really low currently our latency is about 200 milliseconds and secondly of course the constraint of cost we're a startup so we wanted to work around the several pricing strategies that Amazon gives us in order to ensure that we have the possibility of scale but we're keeping our costs really low. So if you work with Amazon, Amazon allows you to rent server space and the server space is called an EC2 and EC2 is categorized by the resources that they have so on the basis of memory and CPU so you can have instances with a lot of CPU and little memory or you know like various combinations of memory and CPU but further it's also interesting how you pay for the server space so an interesting way to think about the pricing strategies that Amazon offers you is one stability of the instance that you rent and secondly the amount of time you want to use it for so on the basis of these pricing strategies instances can be categorized into three types you have the on-demand instance which is very expensive and you pay for it using a pay as you go model so minimum usage is one hour and you pay for every additional hour of usage and these instances are very very stable so you would have to explicitly you know shut down these instances in order for them to disappear and secondly you have the reserved instances these are medium range and cost so they cost about 50 to 70 percent the cost of the on-demand instance and again these instances are very very stable you pay for these instances by reserving them for a certain number of hours in advance so you do run the risk of reserving an instance for a longer period of time than you need it to be around and luckily you have the spot instance so spot instances come from amazon's unused server space and they're ridiculously cheap in fact they you can have about 90 percent savings as opposed to what you'll be spending on on-demand instances but these instances are very very unstable and their pricing is governed by the spot market which i'll come to in my next slide so the amazon spot market is a very interesting place and it sort of works on similar strategies as to what gambling works on so you have a bid price for an instance type that is depending on how much cpu and memory you want you set a bid price and if your bid price is lesser than the market price your request is fulfilled so the market price of course is constantly going through a flux because the market price changes if the demand for a certain instance type increases so when the demand for a certain instance type increases to an extent where the market price goes higher than your bid price your instance is terminated with a two minute warning that's all you get you just get a two minute warning and your instance is shut down and again when the spot market stabilizes and your bid price goes you know higher than the market price your instance will come back to life so it was winter 2015 and at mapbox our traffic had reached a point where we definitely had to scale and i mean of course the obvious answer would be to scale linearly right but this would also mean that you know there would be a linear increase in cost so we wanted to mitigate this linear increase in cost by designing a system that would be very very available but that could still run on spot instances so this is what spot swap is but before going into the details of spot swap i want to talk about two architectural features at mapbox which made spot swap possible the first of this is that our systems are very very fault tolerant so all our persistent data is actually stored on a hosted resilient and scalable database like the amazon s3 and dynamo db and this persistent data is accessed remotely by each ec2 so the ec2 in itself does not store any persistent data and secondly the ec2s have very very fast boot times so it takes an ec2 under two minutes to become healthy and to be able to serve traffic from the time that it comes back to life so given the potential for turbulence in the spot market you would think that you know you should use the spot market to run staging services or you know to run development environments but we managed to mitigate this risk around critical availability by running a very very available system on the spot market as i mentioned and we call the spot swap so the spot swap ecosystem consists of two auto scaling groups you have a spot auto scaling group that's always scaled up to full capacity and you have an on-demand auto scaling group which does not have any instances usually but when it comes to life it's filled with on-demand instances which will last for as long as you say that you don't want them to be around so how does spot swap work in essence our strategy with designing spot swap the way it is designed was to counter the risk of the spot market or the turbulence of the spot market with the on-demand auto scaling group so every spot instance is actually configured with an upstart script and it's constantly polling a metadata endpoint and it's trying to find out if it is at risk of termination so it's important to note here that this information where you know an instance is able to find out if it is at risk of termination is only available with the instance as amazon provides it so there's no external way for you to find out if an instance is at risk of being shut down so when an instance realizes that you know it's going to be shut down in the next two minutes it immediately requests a replacement to itself in the on-demand auto scaling group and how it does this is that it says that desired configuration parameter in the on-demand auto scaling group it increments it by one as i mentioned there's no external way to figure out if an instance is at risk of being shut down so the first bottleneck that we encountered with race conditions and how this happened was that we i was telling you that the auto scaling group is composed of instances that are similar in nature right they have the same amount of cpu and memory so whenever there's a spot price out or you know your bit price goes below the market price there are several instances that are actually at risk of termination it's not just one instance that's at risk of termination so when there's a scenario where several instances realize that you know they are going to be terminated in the next two minutes they have a race condition where they only increment the on-demand group once so you know in this case there are two instances which realize that they were at risk of termination and they both you know hit this on-demand api and found out that there was only one instance so they both set the desire to two so you run into a scenario where there are three instances that have disappeared in the spot group but your on-demand capacity has not come up to the same extent and our strategy our first strategy to counter this was to aggressively scale up the on-demand auto scaling group at any time that we encountered a spot termination so if there was one instance that was at risk of termination we would still scale up the on-demand group to full capacity similarly when there were more instances that found out that they were at risk of termination they would simply set the desire to 16 and in this case the on-demand comes up to full capacity and your traffic still has enough servers to serve responses but obviously this is not optimal right so we realized that there were two obstacles the first was that we needed an external means to assess the termination state of an EC2 and secondly we needed a resource that could assess like you know in a centralized manner assess both the state of the spot auto scaling group as well as the on-demand auto scaling group so for the former what we decided to use was amazon's tagging feature so amazon has this feature where you can tag your EC2 with a certain metadata and this metadata is accessible by an external resource so any EC2 that was at risk of being terminated would tag itself and the second part which was that we needed a centralized means to assess both the state of the spot group and the on-demand group we decided to go with AWS lambda and AWS lambda is a stateless resource so you can configure it to run at periodic intervals of time and assign it a certain task that you want done so in our case the external lambda would keep track of the number of EC2s in the spot in the spot auto scaling group that were tagged as being shut as being at risk of termination and then it would just send a single API call to the on-demand auto scaling group to increment it so in this manner we were able to optimally scale the on-demand group whenever the spot group was at risk of losing capacity but I was telling you that we mostly run our services on spot instances so how does the service go back to its previous state where it was serving traffic pretty much entirely on spot instances so what happens is that at this point the spot and the on-demand auto scaling group are both serving traffic but at some point the spot market stabilizes and slowly instances that you lost start coming back to life and at this point what happens is that your system is over provision to serve traffic you have more capacity than you need so the CPU utilization on the on-demand group will slowly start to drop so at a certain threshold we've configured an auto scaling policy that will aggressively scale down the on-demand auto scaling group when it hits that threshold so at this point it's funny because that looks like an emoji a sad emoji they all realize they're going to die but yeah so it slow it it realizes that it hits that threshold and it starts to scale down really aggressively and then your system goes back to running entirely on spot instances so we have seen that this takes minutes and you know sometimes even hours for the spot market to stabilize but we have seen that even those brief periods where we run entirely on on-demand instances to sort of weather the storm we haven't really cut into our savings and our savings have actually been up to 50 to 80 percent of cost on our api architecture and even on our caching architecture which we have migrated to spot swap but obviously this system has improvements right and one of the roadblocks that is obvious is that the spot price out affects more than one instance because we are dependent on auto scaling groups so one of the improvements that we made was to use amazon spot fleet so spot fleet comes with a set of diverse instance types so you can have instances that have different memory and cpo resources in the same spot fleet so now your spot price out will not really affect more I mean will not affect most of your spot fleet it'll only affect like a few a small part of your spot fleet and secondly like um since we've migrated to docker in ECS now we've noticed that our boot times have become even faster because now we are deploying containers and this brings me to the end of my talk where I want to tell you that spot swap is actually now open source and I hope all of you can use it and keep the cost benefits that it offers and I'm happy to take any questions now thank you hello so the spot instances are there for to use ideal capacity and your try really trying to do is to use it for in production if everyone start using that there won't be any spot instances right sorry if everyone starts using spot swap yeah there'll be a lot more demand for spot instances and then the whole point of spot instances will go away yeah I mean but I mean it's just like a it's an architectural concept right so I don't know if that happens we'll have to figure out some other ways to go around it I guess okay the other question I have is can you spot instances along with ECS I didn't know could do that uh yeah you can use spot instances with ECS you can use auto scaling groups with ECS so currently our model is a mix of a spot fleet and a non-demand auto scaling group so uh can you do the auto scaling in the spot instances as well yeah yeah we scale our cluster yes okay interesting yeah oh I think like I was just thinking a good answer to your first question could be that not everybody everybody will use the same kind of instance type right so maybe we're still protected by the diversity of instance types that Amazon offers maybe that's a good like reason why it won't affect everyone so when this migration is happening from spot instances to on-demand instances do you did you guys experience any latency in the application not really because um firstly a spot group we have a spot group over provision in the first place so we have like a lot of capacity in the spot group and as soon as the spot group is going down the on-demand comes to life and I was telling you that our EC2s come uh are healthy you know to serve traffic in under two minutes so it doesn't really it doesn't really affect our latency very much okay so what is the average time in terms of hours so the spot instances are available so once I provision uh mentioned in the real world so what is the average which is available the spot instances you mean how long it is available for uh you pay for it per hour so you in the real world you might have seen actually you know how long you know you are actually it's getting down I mean you have at the bidding you have to um I don't have a good sense of the numbers like for the instance types that we use we have seen them it it totally depends on the market demand and you can't really tell it fluctuates wildly and but any number which we can actually know we have to see right so sometimes I know I can get back to you with the numbers but I don't know it off the top of my head no problem yeah I've had that I mean I've had instances that have lasted for weeks and I've also had instances that died in the first hour so it's really hard to tell okay that's what that's what I want to sense actually you know what is actually oh yeah it's totally it it totally depends on the market demand at that point of time and again it also depends on which region you request your spot instance from so you know in Amazon US East 1 in Virginia you have really high demand because several people have services running there so a spot instance in Virginia is more likely to die than something in a region that has lesser demand like Tokyo or something it also depends on time zones right so you know your services people are going to be accessing your services during the day in certain places and at that time there will be a lot of demand for for instance is there okay so depends on a range of factors then would be the one of the techniques we can use it actually where we have actually less usability than we can present then it can be high availability one of the techniques that we use is that we set our bid price to be the same as the on-demand price because your market price so okay wait let me explain this properly so we set a bid price to be the same price as the current on-demand market price and we expect the spot market price to be below that value so whatever is that value is what will be considered as the price for your instance so you can even bid like 50 dollars right for example i mean you shouldn't do that but if you bid 50 dollars and the market price is say 50 cents that is what will be considered so setting like experimenting with the bid price is a good is a good technique i guess yeah hello so you have two auto scaling groups independently of say 16 on one and zero on this right yeah now say it goes 16 on one and four on the other but i don't know the number that 16 is good for me maybe i want that to be scaling also how do you guys manage that the sum of these two should be scaling up and down based on you know the overall latency of the overall cpu usage do you guys do that already so the thing is that we don't want to run on on-demand right like our strategy is to run entirely on spot so the 16 is just an arbitrary number here so it's not we don't run on 16 servers but it would be a fixed number how do you dynamically keep changing that 16 number also so if my cluster i mean you have two clusters one is having 16 instances of which four spot instances are dead and so you're running on four of them on on demand now how does the auto scaling group know that i should scale down from 16 to say 12 right now because it's evening hours and the traffic has gone down oh good question so one of the things that i mentioned was the spot group never scales down so the spot group always has the same amount of capacity we don't have a an auto scaling policy on the spot group so that will always have a set number of servers and we've noticed that even in even with that over provisioning we save a lot of money so we've kept it that way thanks hey i'm here so my question is i can answer already so what i wanted to ask is for the bid price can we set it like can we have dynamic bid price like we said it lower if the spot is going out of the service we can make it higher before going to the on demand i mean you can you cannot do that because your instance like the market price is already determined that your bid price is lower yeah so can we dynamically increase the bid price before going to the on spot like on demand instances it's very hard to estimate as in like when you know you're at risk of losing capacity it's not something that we have but maybe it's possible so currently it does not facilitate this this kind of scenario like increasing the bid price before going to the on demand we haven't implemented it yet thanks yeah hi so we extensively use spot instances spot fleets for this thing so my question is like how do you take care of deployment so my current setup is like we reboot the instance and the upstart script will pull the latest artifact and everything but if we have like more than 20 30 boxes then it takes time to do that like so what will be the good strategy for that okay so we actually use spot fleet on our cluster back in the day when we used to use auto scaling groups we didn't use spot fleet we used to use only auto scaling group in the cluster what we do is we basically allow the user to select five different types of instances and this choice is they can further configure a weight that should be given to each instance and yeah we determine this weight by the memory so every 100 mb of memory is one unit in our case and then we also like give a bid price for each instance and that's how we deploy and then we have a maximum capacity that we want the fleet to have right the maximum weight and depending on the weights that we have configured when we are trying to create the cluster it automatically like provisions the cluster with a choice of instances like I think there are two ways to choose the kind of instances you want one is maximum diversity and the other is low price and we've chosen maximum diversity so we want the instance selection to be as diverse as possible and we give it a weight for each instance and an overall weight and that's how it provisions the instances automatically there's Shubham I wanted to ask when you are showing the spot id auto scaling group and the on-demand auto scaling group since instances are like paid by art can there be something like thrashing where you spin up an on-demand server and your spot instance like comes back before one hour right so you'll move back again and I'll end up increasing the cost is the question clear no can you so yeah so let's say your spot instance goes down you spin up an on-demand server and then your spot instance comes back up again and this between one hour right so you'll end up paying for both for one hours so how do you tackle that problem okay we haven't really tried to so I mean I think for us like we realized that the cost savings was so much even with this system that we didn't really try to iterate even more but it's open source now so I'm happy to like receive feature requests and like full requests you know from you but we haven't we haven't really implemented it hi yeah hi that was a wonderful talk one thing which I want to know is like you said like a spot fleet where there are different types of instances so how do you match the capacity on the on-demand instance like you know I mean there are different type of instances there is some policy check which instance has gone down so that you would bring in the on-demand one yeah good question so one of the things that I didn't mention in that answer was that we also have a weight and a capacity for the on-demand group so you can set a certain weight for an on-demand instance and you can also set an overall weight that you want for the on-demand group so this is something that we configure per cluster in our service in our stacks so my question is around deployments so when you have this kind of open screen groups and on-demand and spot ones right how do you handle your deployments oh okay so at map books we use cloud formation for everything we don't use the amazon console at all and we also have we have one open source tool called cfn config that you can all look up and we use that for deployments so it you it gives you like a set of parameters that you can configure and then based on your cloud formation template the deployment is done no my question was more around the application deployment right so when you once you rebuild your code base right or you have a new artifact to be deployed how do you deploy that how do you upgrade your instances or the software you have on those because they might terminate anytime right um do you really um upgrade the image itself the image for the AWS instance yeah how do you handle that part like your weekly deployments or whatever the deployments you do yeah how do you handle that part the software um so spot swap is currently part of our cluster and the deployment of our services is independent of this so the cluster exists as a separate stack and um spot swap is part of this stack and the service will when you deploy the service it essentially just sits on one of these containers right it's not worrying about provisioning or it's not worrying about whether it has enough servers so the cluster takes care of that so once you have a cluster deploy that's done uh the fluctuations of the market will sort of determine the number of servers that you have whereas your service deployments will not directly impact this does that answer the question was an extent okay i can also like talk to you after the talk yeah all right thank you everyone