 Hey everyone, welcome to the stock. My name is Sagar Patwardhan. I'm a software engineer at Yelp. I'm part of the distributed systems team and I've been working at Yelp for last two and a half years. So today I'm going to talk about Siegel distributed fault tolerant concurrent task runner that we built to mainly run our tests in parallel. So let's talk a little bit about Yelp. So Yelp is a mobile app and a website that connects people with great local businesses. So it's basically a search engine where you can find whatever you're looking for a restaurant, a dentist, a doctor maybe. To give you a sense of how big we are, we have 28 million mobile web users. These are the monthly active unique visitors that we get. We have 74 million mobile web users and 83 million desktop web users. So we are pretty big. So in today's talk, what I'm going to cover is what is Siegel, why we built it. I'm also going to give a comprehensive overview of Siegel, how it works. So deep dive into Siegel. I'm going to talk a little bit about the cluster autoscaler that we built at Yelp. Next thing will be challenges and lessons learned and last thing will be a future of Siegel. So let's get started. To give you a little overview of testing at Yelp, we have a hundred thousand tests that we need to run in order for our developers to push code to prod. We have a monolithic application which has a lot of tests and we also have service oriented architecture and over 150 services. So if you execute these tests serially, it will take you two days to complete all these tests. So we can't really waste that much time on testing. We have north of 500 developers and testing kind of has a direct impact on developer productivity. The more time you spend in testing, the less productive our developers are. So Siegel to the rescue. So as I mentioned before, it's a distributed fault-or-end system which runs tasks in parallel. So to give you a sense of how big Siegel is, we usually have 350 Siegel runs every day and average runtime for each Siegel run varies between 10 to 15 minutes depending on what kind of tests we are running. We spin up 2.5 million ephemeral Docker containers just for the testing use case. These containers are basically different services at Yelp. Developers want to test their service against different services. So we spin up a lot of containers. We use cluster autoscaler. So our cluster is always varying in size. So at night when the load on our cluster is minimum, we have 70 instances and our scale, our cluster goes up to 450 instances during peak time, which is in the afternoon US time. We have started using spot instances. So initially we started off with RIs and spot instances. Now we are just using spot instances. Every day, Siegel executes 25 million tests. It's basically to ensure that developers are pushing code that is okay to push to prod. So let's talk about applications of Siegel before diving into Siegel itself. So mainly it was written for the testing use case to run tests in parallel, but we have extended it to different use cases. So we have a load testing framework called Locust. This framework is mainly used for testing different endpoints of services. So let's say you are a service owner and you want to see how many requests per second you can sustain for a certain endpoint of your service. You would use this framework and make parallel calls to your service and see if it can sustain the load. Next use case is photo classification. So Yelp has tens of millions of high-quality photos and we use deep learning to figure out whether a photo was taken inside a restaurant, whether it was taken outside a restaurant, whether it contains food, whether it contains drink, etc. So we train the model and then we basically use that model to classify photos on Siegel. So initially we used to run this on Spark, but now we have migrated it to run on Siegel and we run classifiers on close to 90 million photos in less than a day. So let's talk about Siegel. So I'm going to walk you through a testing workflow and then I'll go into specifics about each section. So let's say you are a developer at Yelp and you have created your branch, you have committed your code and you want to run tests. The first thing you will do is you will use a command-line tool called Run Siegel. You will specify which branch you want to test and which tests you want to run. After starting a Siegel run, we start a build on Jenkins. We do that mainly because there are certain things that need to happen before we can start executing tests on your branch. So one of the first things is we create an artifact. So we pull developers get branch and then basically create a tar ball, do a bunch of things and upload it to S3. This is later used by executors to run tests. The next thing we do is we need to figure out how many tests are there in your branch. So we run a job called Test Discovery. This job goes ahead and basically calculate, basically figures out how many tests are there in your branch. So next thing that we do is, as I mentioned before, we run tests in parallel. So we need to split these tests into smaller chunks or bundles as we call them. So we run a program called Bundle Creator, which divides these tests into multiple bundles. After doing that, we start the Siegel Meso Scheduler, which talks to Meso's master and it basically gets offers from different agents in our cluster. After getting all the offers, we start sending executors to the agents. First thing agents do is they download the artifact that we uploaded in the previous stage and they untie the file, they compile PYC files and start executing tests. After all the tests are done, the results are reported to Elasticsearch and Kibana, which is our main data store for storing test results. One thing that Siegel executor does, apart from reporting results, is upload standard error standard out of each executor. So this is mainly used for debugging. So we direct standard out standard error to a specific file called Siegel log and we upload it to S3 at the end of each Siegel run. We also report a real-time metrics to signal effects in DynamoDB from our scheduler. We have a UI service called TestResults. This is mainly used by developers to see results of their tests, whether all the tests in their brands are passing or not. And yeah, that's like a complete overview of Siegel. So this is specifically for the testing use case, depending on the use case workflow will vary. So now I'll talk about specific of Siegel. So first is Siegel Meso scheduler. So the scheduler itself is written in Python. We are currently using LibMesos to talk to Mesos. We have not migrated to HTTP yet. We have one scheduler per test suite. As I mentioned before, there are different test suites at Yelp. For example, unit tests, acceptance tests, integration tests, selenium tests, et cetera. So each test suite basically corresponds to a Meso scheduler. So at peak time, we have more than 50 schedulers running. So we rely on Mesos' DRF algorithm to distribute offers to us. And so far, it has worked really well. We have never run into an issue where a scheduler is starving for offers or a scheduler is getting a lot of offers. Siegel also has customizable concurrency. So because it is supposed to run batches, you can either run your batches really quickly by setting the concurrency to a very high number, or you can reduce the concurrency and let your batch run for a really long time. Siegel scheduler is also fault tolerant. So it can deal with agent failures. It can deal with bundle failures. It basically retries bundles. And I'll talk more about the retrying strategy. Let's talk about placement strategies. So basically, the aim here is to optimize for Siegel bundle setup time. As I mentioned before, each executor has to download tar ball from S3. The tar ball is typically a few GBs. And we also have to do some things before running tests, like compile, Python files, et cetera. So to optimize for Siegel bundle setup time, we use two strategies. One is affinity for slaves. So if we have used any particular mesos agent before, we'll try to use the agent again so that the executor tar ball is already available and ready to use for the new executor that is getting scheduled. And the other thing is we use as many resources in an offer as possible. So if you get an offer with, let's say, 30 CPUs, 200 gigs of memory, we'll try to fit as many Siegel executors into that offer as possible. Siegel is supposed to be a batch runner, so it's really time flexible. There's no upper bound on when a Siegel run should finish. It's okay if our Siegel run gets delayed by a minute. If an agent goes down and we lose all the executors on that agent, we'll reschedule them and they'll finish in a few minutes. This also simplifies scale down because we are using up all the resources that are offered to us. We are always left with a few idle agents in our cluster which can be terminated without causing any disruption in our cluster. So this UI here is what we call Bundle Visualizer. So y-axis is the time scale and x-axis is basically bundles. So the green bundles are the ones that have finished successfully. The red ones are the ones that have failed. So let's say a bundle fails. What we do is we split the bundle into two bundles. So let's say if a bundle is of 10 minutes worth of tests, then we'll split it into two bundles each of five minutes. This is to basically make sure that we are finishing within a reasonable time. We don't want to schedule the bundle again and spend 10 minutes on that bundle. Siegel Executor. So we have written a custom executor in Python. This executor is currently using Mesos Containerizer and different Siegel isolators. The main job of this executor is to do set up and tear down. So set up, as I mentioned before, is basically getting the tar ball. And the tear down is reporting a bunch of stats. Apart from set up tear down, it also reports utilization of our process. So we run a thread in that executor which is constantly getting CPU metrics, memory metrics, how many network connections we're using, disk usage, et cetera. We use that to allocate resources to our Mesos bundle. It also uploads log files to S3 and reports test results. We also have special constraints to run these tests. We have some resources that we call them cluster-wide resources. These resources are not tied to a particular agent. So the typical example of these resources is Selenium Connections and Database Connections. We want to allocate them without tying them to specific agents. So to achieve this, we use ZooKeeper, fmrl-z nodes. So there are two z-nodes in our ZooKeeper cluster. One keeps track of how many connections you can use, and the other acts as a parent node for fmrl-child-z nodes, children-z nodes. So basically, let's say you are an executor and you want to use a Selenium Connection. You'll first check what is the limit on number of connections. Let's say for Selenium. Then you will query get children on the parent z-node and see how many fmrl-z nodes already exist. If it is less than the limit, then you will create a z-node and use a Selenium Connection. Same goes for Database Connections. We use ZooKeeper logs to make sure that our access is atomic and we are not running into any race conditions. As you may know, once the executor goes away, these fmrl ZooKeeper connections will be removed and will basically reclaim the resource at that point. Let's talk about monitoring and alerting. So we have real-time monitoring and alerting using SignalFX. So we send time series data to SignalFX and they visualize the data for us and they also provide alerting. So the graph that you see here is representing the number of bundles that have failed in the last one day. So green bars are corresponding to finished bundles. Red bars are the bundles that have failed. Blue bars are the bundles killed and yellow bars are the bundles lost. For each bundle in Seagull, we specify runtime. So for example, for a test use case, we specify the runtime equal to 30 minutes. So if a bundle goes over 30 minutes, we go ahead and kill the bundle, assuming there's something wrong with the bundle. We also do log aggregation in Splunk. As I mentioned before, we upload a standard error and standard it out of each executor to S3. We then ingest all this data into Splunk and we basically query Splunk to get statistics across multiple Seagull runs and across the cluster. So in this example, I'm searching for a Yelp serializable object validation error. So it's basically showing us when it happened in last four hours. This has been really helpful in debugging cluster-wide issues. So we have better instrumentation of what's going on. So as I mentioned before, we split tasks, our tests, into multiple chunks. So to do so, we use two different algorithms. First is a greedy algorithm, which is basically a bin packing algorithm. We keep track of test timings. So whenever a test is executed, we store how long it took to execute and we upload it to Elasticsearch. We then run a nightly cron job, which calculates P90 for last one week for each individual test and it uploads it to DynamoDB. And this data is basically used for bundling. So when we are creating chunks, we sort the test list and we basically split the tests into bundles of 10 minutes. This algorithm has worked really well for us. We've also tried different linear programming algorithms, but the gain that we get with those algorithms is not significant. So in our Selenium test suite, there are a bunch of constraints for executing tests. So some tests have to be executed together while some tests can not be executed together. So to achieve that, we have written a pulp LP solver. So this is a really simple solver which has three main goals. One, to make sure that a single test is present in a single bundle. Bundle never goes about 10 minutes, and number of bundles created is as minimum as possible. So as I mentioned before, we use an autoscaler for our cluster. You must be wondering why it's a bad system. Why do you care? So this is the weekly user strand of our cluster. So during weekends, the cluster is at minimum scale. On Mondays, it scales up Tuesday and Thursday, and it again, the resource utilization goes down. So all the dollars that you see here is the money that we save. Same goes for daily user strand. So we have developers in Europe which they basically push code in the morning US time, 3 a.m., et cetera. And during US office hours times, the cluster utilization goes up. You can also see a spike during lunch hour. So the utilization goes down during that time. So this is the overall architecture of our autoscaler, FleetMizer. So there are two components of FleetMizer. One basically gathers data from Mesos and other sources, and it stores the data into elastic search in DynamoDB. We use signal effects and sensor for monitoring the autoscaler itself. So it emits useful data regarding when it took a particular decision, what did the cluster look like and why it took a particular decision. And it basically queries Amazon API to scale up or scale down the cluster. So the autoscaling strategies that we use are we use two different strategies, CPU utilization and seagull runs and flight. So CPU utilization, our workloads are CPU bound. So we always run out of CPUs before memory. So we track CPU utilization and if the CPU utilization goes above 65% for the last 15 minutes, then we scale up the cluster. And while scaling down, we check if the cluster utilization is below 35% for 30 minutes. If it is, then we scale down the cluster. This signal was added last year. So we basically, whenever a new seagull run is triggered, autoscaler gets notified that a run is in flight. So this has helped us a lot. To give you an example, let's say we are about to scale down, our cluster has been idle for last 30 minutes. But all of a sudden, developers in the EU come along and send a bunch of seagull runs. So our autoscaler anticipates how many seagull runs are in flight and how many resources still need. And depending on that, it either prevents the scale down or adds more resources if needed. Scaling down is difficult. So we have implemented placement strategies to allow smooth scaling down. We also use AWS Spot Fleet. And one disadvantage of that is that we cannot specify which instances to terminate. So our autoscaler queries each individual mesosagent and figures out how many tasks it is running. And if a particular agent is not running any tasks, then it will select that instance for termination and it will go ahead, terminate the instance, and then readjust the Spot Fleet capacity. This is to basically trick Spot Fleet into thinking that we have terminated the instances that we want. As I mentioned before, we use Spot Instances. So back in May of 2015, we were using a combination of RIs, which are reserved instances, and on-demand instances. We slowly started ramping up on Spot Instances. And you can see in July 2015, we saw 55% reduction in cost. And we started using all Spot Instances around December of 2015. And the utilization, the cost of the cluster has gone down 80%. So that's a huge amount of saving. Now I'll be talking about some challenges that we faced while building the system and the solutions that we came up with. So first issue we ran into was bandwidth while talking to S3. So we are using NAT boxes to talk to S3, and there's limited bandwidth that NAT boxes have. So we also use Docker images, which are backed by S3 buckets. So we had to come up with a way to kind of avoid talking via NAT boxes. So basically the solution here was to use VPCS3 endpoints. They provide fast and secure access to AWS S3 without any bandwidth limitation. All the S3 limitations are enforced, but in terms of bandwidth, there are, they don't impose any restrictions. And one good thing here is that your traffic never leaves Amazon network. Your traffic never goes to public internet. It just stays on Amazon network. But one caveat here is that your data and your application has to be in same AWS region. So if your bucket is present in an AWS region, then you must co-locate your application in that same region to take advantage of VPC endpoints. After moving to VPC endpoints, we've seen lots of free bandwidth on our NAT boxes and other applications in the VPC are happy now. Central Docker registries. As I mentioned before, we run 2.5 million Docker containers and lots of different images we download from Docker registries. Docker registries are backed by S3. So they simply send S3 redirects after getting a request. But, so initial setup we had at Yelp was basically we had multiple Docker registries on a single host backed by an NGINX instance. NGINX was supposed to load balance across these registries. But the scale at which we were operating basically failed to cope up with the number of requests we were making. We also tried to put registries on different hosts and put them behind ELBs. But we ran into issues with sticky cookies, et cetera, and we couldn't really do it. So the solution here is that we run multiple Docker registries on each agent. So Docker registry is basically a Docker container which sends a redirects to S3 and it doesn't matter where it is running. So on each single agent, we have two different registries running and we use ZC hosts to basically resolve the result to local host. So this has been a game changer. We have never run into any issues with a bandwidth or Docker registry is not being able to cope up with the number of requests we are making. I highly recommend this if you ever have run into this issue. You should definitely try it out, try out this solution. The second main problem is spot instances. Spot instances can be reclaimed at any point of time by Amazon. So you have to make sure that your application is fault tolerant. It can deal with executors going away. It can recover from this failure. So what we do is we have a cron job which is constantly checking for termination notice. After getting the termination notice, we terminate all the executors that are running on the agent. Our scheduler gets task loss status update for each executor and it basically reschedules them elsewhere. We also terminate Mesos agent process. In order to prevent it from getting more tasks and basically losing those tasks. So this has also worked really well for us. We have not run into any issues with spot instances going away. As you may know, spot markets are pretty volatile. Bit prices for spot instance prices can go up, down, they can fluctuate and can have an adverse impact on your application. So getting the bit prices really hard, how much money you bid on a particular instance, it's hard to determine. And the trade-off that you have here is basically availability and cost savings. If you're willing to pay more cost, then you can get more availability and vice versa. So the solution here is to make your application fault tolerant and diversify into more spot markets. So for Seagull, we use 10 different instance types and three different AZs. So we have 30 different spot markets. We use a wide range of instances, C4, I2s, M4s, and basically this has allowed us to have compute cluster available all the time. We also use spot fleet that makes sure that we are never paying more than on-demand price of an instance. If price of an instance goes above on-demand price, then we basically terminate those instances and we get capacity from other markets. So this allows us to keep the cost down. Issues with Docker daemon. We have ran into a lot of issues with Docker, mainly because of Docker daemon. Sometimes Docker basically gets locked up and stops responding to our requests. We've also seen deadlocks in Docker daemon. Every time we upgrade Docker, we always run into some or the other issue. Docker daemon sometimes fails to resolve DNS while other tools like DIG are working fine. We've tried different solutions like Sego, DNS, et cetera, but it has not really worked for us. We use AUFS as our union file system and we are running kernel 4.2. And we've often run into issues where it basically causes kernel panic and our CPU is going to soft CPU lockup state where you cannot run any task on the agent. So we basically run a cron job, which periodically it SSHs to each box and it checks a D-message and sees if the instance has gone bad, if it has, then it goes ahead and terminates the instance. But the AUFS is probably one of the better union file systems based on our experience. Orphan Docker containers. So after running 2.5 million Docker containers, we have had a lot of orphaned containers. So what happens is that application tries to remove containers, but because Docker daemon is not responding, it gets an exception and just exits and Docker containers are left behind. These containers are not account, the resources that these containers use are not accounted for in Mesos. So Mesos thinks that there's more room to schedule tasks on the host and eventually our boxes run out of memory. So this used to cause a lot of issues for us and then we came up with a solution. So we wrote a tool called Docker Reaper. So it's a proxy for Docker daemon. It's written in Go, it's supposed to be transparent. So if you send a particular signal to it will forward to its children and it basically cleans up Docker containers after the child process goes away. So the way it works is we have Mesos executor. It creates a Docker Reaper. It basically creates a unique socket and sets the Docker host environment variable which is used by Docker clients, Docker daemon. It fork execs the child process. Child process then goes and creates containers. So create container API call is intercepted by our proxy. It forwards the call to Docker daemon. When it gets the response back, it gets the container ID, it records the container ID in memory and forwards it to the child process. When the child process goes away, it goes ahead and removes the container. So this has basically allowed us to reduce the number of orphan containers. But because of various issues of Docker daemon, we still are left with a few orphan containers. So we run different cronsops on the agents that check which instance of its Docker containers are running for more than 30 minutes and they go ahead and remove the containers. Mesos maintenance mode. So this is a great feature that Mesos provides but unfortunately it is designed to be used by a single operator. So anytime you make a post request to this API, it will override the existing maintenance schedule. So if you want to use this feature in your agents, you need to make sure that only one agent is able to talk to Mesos and make post requests to a particular endpoint. You need basically external locking to ensure that this happens. So you can use a Zookeeper based mechanism to make sure only one agent is talking to Mesos. So yeah, we are talking to Mesos folks and we're trying to figure out if we can improve the maintenance mode and make it usable by multiple operators. So yeah, future of Siegel. Where would we like to go next year or so? We'd love to use oversubscription. So oversubscription will basically allow us to use residual capacity in the cluster and run low priority batch workload. For example, photo classification. If there's no stringent upper bound on when classification should finish, we can just run it on the residual capacity in the cluster. We're also replacing the core component of Siegel scheduler. Siegel scheduler was written a long time back and it needs an upgrade. So we have basically written a library called task processing and we are replacing the scheduler with this library. We'd also love to use CSI plugin once it is available and replace the cluster wide resources with this solution which is more robust and we delegate the responsibility of allocating cluster wide resources to Mesos, which is great. We also wanna make Siegel easier to use. So more and more people at Yelp can take advantage of parallelization that Siegel provides. Executor improvements. We'd love to containerize everything. We love Docker containers. Most of our products in workload runs in Docker and we'd love to containerize everything. We'd also like to use Mesos containerizer and Docker runtime and basically eliminate the need to talk to Docker daemon. This will save us a lot of effort and money. Basically we don't have to talk to Docker daemon after using Mesos containerizer. So Mesos recently added this feature called nested containers and parts. This is a great feature, especially for Siegel. We can use nested containerizer and spin up Docker containers within that container and once our application finishes, the container will go away and it will reap all the containers that were started within the nested container. So we don't have to deal with often Docker containers. Autoscaler improvements. While our autoscaler works reasonably well right now, there are things we can do. There are better algorithms that we can use and drive up the utilization of our cluster. Right now we are using a single spot fleet request and this basically prevents us from using more spot markets. So with a single spot fleet request, you can only use 30 spot markets. We would like to use multiple spot fleets and expand into more markets and see if we can get more savings. And we would also love to use more instance types in our Siegel cluster. So that may drive it on the cost. So yeah, we are hiring in Europe in London and Hamburg office. We're looking for distributed systems engineers and managers. If you are interested in working at Yelp, please reach out to me. I can get you in touch with the right people. So yeah, that's all I have for this talk. Anybody has any questions? Thank you.