 Hi, my name is Juan Santos. I am a software engineer at Pivotal, and I am working on the Diego team. This is my friend Jim Myers. He's also a software engineer at Pivotal. He's currently the anchor on the Diego team. We're going to talk to you about our work on scaling and performance of Diego, the Cloud Foundry Distributed Container Management System. More specifically, what Diego is, a quick overview of the system, how we defined performance and scaling goals as a product definition, how we designed our performance experiments, and how we evaluate success, and then a little bit about the future of the work and scaling that we're doing on Diego right now. So what is Diego? On a high level, Diego is a distributed container management system. What it means is that it schedules and runs containerized workloads. And Diego is built from the ground up for Cloud Foundry to replace the old Ruby runtime, the DEA. It should give the platform more portability, stability, and scalability. And even though it's a general purpose container scheduler, all of its use cases are driven by the Cloud Foundry platform, such as building and running build pack-based applications, building and running docker-based applications, and also running Windows.net applications. So as we go through this talk, we're going to touch on a few components of Diego that you can see in this picture. And some of the ones we're going to talk about are the BBS, which is the Central Diego API. Everything that happens in the Diego cluster has to go through that API. And that's where the truth is saved, basically. And that storage is backed by a key value store called HCD, which is a distributed key value store that provides us reliable access to that data. Another component is the cell. The cell is ultimately where all the containers are run. It is the capacity of a Diego cluster. To put simply the more cells we can manage, the more containers we're going to be able to run. And it is also composed, like the cell itself is composed of a few components. One of the ones we're going to talk about today is Garden. And Garden is the successor to Warden, the Cloud Foundry container technology. And Garden is a platform agnostic API to expose container and process management without the system having to worry about the details of platform-specific things. And a few implementations that Garden has today that are supported are Garden Linux, Garden Windows, and coming soon Garden Run C. And the one we're using in production right now and the one we used for these performance tests is Garden Linux. And Garden Linux is the Linux implementation of Garden as the name suggests. Another term we're going to use a little bit is batch processes or bulk processes. They are routines that run periodically on Diego. And they are responsible for keeping the consistency of the system. So for example, if in a catastrophic situation we lose all of our data or part of our database, one of these batch processes will run on its period and grab all the desired stage from the Cloud Foundry API using the Cloud Controller API and then repopulate Diego using the Diego API, like restoring the health of the cluster. And then the auction year is the central logic server for Diego. It's the component responsible for making decisions and scheduling. So it's responsible for deciding which cell is going to run which workload or which batch of workload. So it needs to have an efficient way of getting an accurate picture of what the cluster looks like in any given point in time. And lastly, tasks and long-running processes. They are the units of work on Diego. So when you ask Diego to do something, you're asking Diego to run a task or a long-running process. A task is a one-off action that runs at most once and has no uptime guarantees because it's expected to end. And a long-running process is something that Diego will monitor and guarantee that it stays up. So how do we define a scaling goal for Diego? It's largely a product decision. We need to know how many cells we want to support and how many application instances we want to run in those cells. So the first part of the work is defining where we stand. So what are the numbers we can currently support or what are the numbers we could currently support when we started this? And then decide where we want to be short-term and long-term. So for Diego, we initially targeted a larger than average deployment size, which for us meant 100 cells running around 10,000 application instances. That is a modest goal. Those are not huge numbers, but it's a stepping stone into more aggressive numbers that we want to hit in the future. And it allowed us to flesh out a few bottlenecks of the system early on. Scaling and performance is also an engineering concern because as engineers, we want to be proud of the software we are developing. And we want to have a good story to tell when we're asked questions about how do we compare versus some other container schedulers out there in the cloud ecosystem. And while that comparison is important, it is not the primary objective of this experiment, but it's important for us to know where we stand amongst these members of the ecosystem. So now performance is part of the car process of development of the Diego team, and we as engineers have to keep an eye open for opportunities to make the system scale and make the system better in that way. Cool. So yeah, now that we've defined a set of performance goals that address both the product and the engineering team's concerns, we actually now need to design observable and reproducible experiments that we can use to judge the performance characteristics of a Diego deployment. The first thing we need to be able to do is to measure performance. This means we need to define what it means to be successful when you're running a Diego deployment at scale. From product, we know this means that we want to run n number of instances against some number of cells. However, there's more than that. We also know that we want to have all of our containers be routable at all times. We want to make sure that the performance of our BBS API is acceptable and that the bulk process loops are operating under reasonable time limits. Through both metrics and log lines emitted to LoggerGator, we're actually able to monitor the state of the system and get a complete image of performance at any given time. As well as being able to measure and monitor the health of the system, we also need to design experiments that will produce meaningful data. The questions we have to answer are like, how do we generate a realistic load to run on Diego? Also, how does Diego perform under unusual situations such as bursty loads or failures? When designing the performance test suites, we needed to make sure that we generate the correct artificial load so that we can have confidence in the system and the metrics that we obtain. One of the ways we measure performance is through CF's own internal logging and metric aggregation, LoggerGator. We're able to emit various metrics and then use tools such as Datadog to generate charts and graphs that allow us to visualize metrics in the system. In these graphs, we're able to track things like total number of application instances in the state and the deployment, as well as the total number of routes, the throughput and latency of the BBS, and many other concerns. Another thing along with the metrics that are emitting is the internal system logging of components itself. They allow us to take a more detailed look at what is actually happening on Diego at any given time. By post-processing these logs, we're able to create these beautiful graphs like the one seen here, and so what we do for that is we tagged log lines with tasks and LRP guids, and then we can actually map where the individual workload is in the system at any given time. This chart right here is a chart of 4,000 tasks that are running on Diego, and each color represents a different set of its lifetime, and then the x-axis is the time that it took. This is a great way for us to visualize performance bottlenecks and help us immediately find these bottlenecks so that we can solve them quickly. Our first experiment was designed as a smoke test to simply validate basic aspects of performance in Diego. They consist of running larger and larger number of workloads in a short period of time, and these workloads are just one-off tasks like the ones Jim showed, and also some long-running processes. More specifically, we spun up 4,000 instances of a lightweight web server and measured how long that web server took to become healthy on all of those 4,000 instances. That is comparable to what some of the other container schedulers are publishing as their performance benchmarks, and while this isn't necessarily a realistic workload for a production grade Diego deployment, it did allow us to find some obvious bottlenecks really quickly. One other reason that it does not represent a realistic workload is that it only fills up up to 40% of our capacity roughly. Right from the start of these smoke tests, our tools allowed us to quickly observe a few obvious problems with the system. For example, we were spending a lot of time in un-martialing and marshaling JSON, partially on Diego's own code base and partially on LCD's code base when they read their data from the data store, and the other thing we found is that we had hard-coded concurrency variables such as, like, amount of threads you're going to run for a certain batch processing. So those were the two first obvious things we found with the smoke tests. And to fix that, firstly, we replaced JSON with protocol buffers, which is a mechanism to efficiently transmit structured data, which is both smaller and faster than JSON, but also was really good for us because of the tooling that the community provides for code generation and go-link for these protocol buffers. The other thing we did was we replaced REST with RPC, and that allowed us to express our API in a more concise way, and then reducing the amount of network calls we do for simple operations. And finally, to fix the concurrency problem, we made Diego more configurable, so as you grow your Diego cluster, as you give those VMs more resources, you can also grow those concurrency variables. It seems like an obvious thing, but it's something that we missed initially. And once we fixed those obvious problems, we were able to get some meaningful results such as we were able to run those 4,000 instances in less than 30 seconds. And while 30 seconds is not a fantastic time, it is within the boundaries that we consider a reasonable amount of time to start all of this work. And we decided to focus on growing the number of containers we can support on the horizontal scale rather than reducing this amount of time. One other thing we noticed after fixing these initial problems was the time spent on Diego components was significantly lower. And if you put in comparison with how long it actually takes to spin up a container image, it's almost irrelevant. If you look at those charts, those aren't real numbers, but they're based on the numbers from the experiment. The green bar there is the time it takes to spin up a container image, and the other colors are time spent in an actual Diego code. So that was really encouraging. And after this, we felt confident enough to start the real full-scale performance test because we didn't find any other obvious bottlenecks on the smoke test. Cool. So now that we've proven that Diego can run bursty workloads with no obvious performance issues, we began to design an experiment that's going to more accurately reflect the production style workloads that we want to run on this environment. So the test that we designed operates at like a CF level. CF pushes a set of various applications that will load up the environment until it's completely saturated, full capacity. So in this case, for 100 cells, we pushed 10,000 application instances. Each of the different applications stresses the environment in a different way. We had some applications that were very CPU-intensive and generated a lot of logs, but we also had other applications that would crash after a reasonable amount of time. By using this diverse set of applications, we're really looking to show a reasonable user-provided workload on the environment. After we successfully saturated the environment, we actually just let it sit for about a week. The idea here is that we can monitor it throughout the sitting time period, and we shouldn't see any performance degradations over this time period. And then after that, we can actually, since we have a saturated environment, we can push generalized workloads against this fully saturated environment to give us additional insight into how Diego might react to common workloads at this scale. This test suite closely reflects the initial protocol of being able to support the given number of application instances on a given number of cells. It's important to note that while we strive to compete with our competitors' numbers in the previous smoke test, this test suite is aimed to accurately reflect the performance of Diego to the end users and the operators. We want to show that Diego can run a large workload for an extended period of time with no performance degradation rather than proving that Diego can run a generally unrealistic workload very fast. So in our real experiment on 100 cells, we pushed 10,000 application instances, and we were happy to see that we didn't see any performance degradation over time. The BBS API request latency increased from almost nothing on a non-loaded environment to less than a second, which we felt like was a reasonable number. We also saw a very similar pattern for all the bulk processing loops. They were almost non-existent, and then they increased to times that were much less than their operating periods. We were also able to push new workloads onto this environment quite successfully, and ultimately this experiment gave us a huge amount of confidence in running Diego in a production scale system. The other thing though is since we had Diego running in the happy path, we also wanted to test failure scenarios. So now that we had a saturated environment, we actually tried out separate failure scenarios on this environment to see how Diego would react. The first failure mode that we investigated was the partial failure of the database. We interpreted this as a loss of connectivity to the database. So the way that we simulated this was by stopping the console agents on all of the database VMs. This essentially made them unreachable from all of the internal services in Diego. After doing so, the one thing that we see is that all of the applications that are currently running continue to stay running and are routable. This was entirely intentional, and it's one of Diego's design principles, is that running workloads are the top priority. We weren't able to desire new work as expected because you can't talk to the database, but once the connectivity was restored by restarting the console agent, the system returned to a stable state. The next failure that we wanted to investigate was a total failure of the database node, essentially losing all of the data. So the way we simulated this was just by removing the persistent store on the database node. So we just blew away all of the data. And once again, we see similar results to what we saw with the partial outage. The cells and running applications continue to operate successfully and were routable. And then once the BBS functionality was restored, the cells immediately populated all of the running data back into the database, and the batch processes eventually took all of the definitions from Cloud Controller and put in the desired data as well. In about 10 to 15 minutes, we had restored all of the database that we had just deleted. The last failure scenario that we wanted to test was the failure of a cell itself with running applications. In order to simulate this, we just destroyed the VM in the infrastructure. We observed that within 30 seconds, one of our batch processes noticed that all of this workload was missing, and it actually rebalanced it back on to the remaining cells. We did lose routes for the short period of time while this happened because the applications were no longer running, but as soon as they were started up on the other remaining cells, the routability was restored and everything was great. Once we killed enough cells so that we had too much desired work for the capacity that was available, one thing that we noticed was that running applications were actually favored over crashing applications. This is largely due to the backoff that we have when we scheduled applications that have crashed. This is pretty important because it shows that a Diego deployment will actually run workloads that work rather than workloads that are misconfigured. Once we restored all the cells as well, all of the workload was once again able to run on the system. With these performance experience, we're pretty confident right now that Diego could run a reasonable number of instances on 100 cells, but as Cloud Foundry keeps growing, we know that we're going to have to run environments that are in order of magnitude larger than what we did in our tests. So, for example, our new performance target is 1,000 nodes and 200,000 application instances spread across the cluster. We would love to run this performance experiment on a regular basis. However, this is pretty hard to do because it's very expensive to run 1,000 nodes, and it's also really hard to manage over a period of time with just the devs on the team. So what we did is we created an experiment that we like to call the benchmark BBS. And the main idea behind this experiment is that we can simulate the total load on the API server of 1,000 cell deployment without actually running 1,000 nodes. So we are pretty confident in the fact that a single cell when saturated is easy to performance test, but this is more testing the distributed nature of Diego over time, and as we increase more nodes, how does the central database respond to this activity? So one of the ways we created this test suite was by analyzing the activity that our various components make against the API and then replicating it from a smaller number of nodes, basically. Within this experiment is mainly set to analyze the performance of the read and write characteristics of the BBS. Specifically, we want to know how long our bulk endpoints take, and we want to know how often do we fail to write or read from the database. The results from this test were kind of unexpected. We found that LCD had become our biggest bottleneck in the system. This is largely due to the fact that LCD v2 doesn't support large data sets. Specifically, in this example, we are almost at about a gigabyte, and it also doesn't natively support traditional niceties such as indexes. As we added more and more data, we also saw that LCD continued to degrade. Cells were taking longer than 30 seconds to fetch their required records. This is essentially because you have to do a table scan on each fetch. We also noticed that we were failing to seed the data for our test, and also bulk loops were beginning to take longer than their expected operating intervals. We know that there are talks of LCD v3, which will improve on a lot of these issues by adding indices. However, v3 is still in beta, and we need to hit these performance targets now. We know that we're going to have to make some changes to our database backing technology. How are we going to get to this scale with these blockers? One thing we're doing right now is we actually currently have experimental support for a relational back-end on Diego. You can configure a pluggable relational database that has to be deployed alongside Diego, and Diego will just use that. Reasons for this switch go beyond just scaling and performance. Our data model is inherently relational. For example, applications have many instances, and cells have many containers. It's a common thing you say when you have a relational model. We also have been wanting rich support for indices in our data layer for a while. Having to do a full scan, like Jim mentioned, is not really the most efficient way to get that data. One way we could get around that is by completely normalizing our schema, but that's not really sustainable engineering-wise. One of the concerns we had when we started this effort to switch to a relational database was high availability, because SCD is highly available by definition. It's a cluster, and you get that high availability for free. While these traditional relational databases are generally single node by default, but we're confident that efforts like Galera for MySQL will bridge that gap efficiently. Even in the face of Fader, we saw that Diego is pretty good at recovering from that and being robust with minimal to no impact to the user. Finally, we made the switch after exploring this option using that benchmark BBS suite that Jim mentioned with a minimal implementation of the relational backend. We got some really encouraging results from that. We were able to store upwards of 600,000 instances in the database even before we started paginating or fragmenting data. That's a lot more than we could possibly have with the old model. We decided to allow both MySQL and Postgres being the two SQL implementations we currently support on Cloud Controller and UA, because that would help support the broader Cloud Foundry community. They're also widely used in the world and proven to be effective when used correctly at large scales. Before you ask me when is it going to be out of experimental mode, we can follow the progress in Tracker. We're working out some last implementation and migration details and are going to start a full-scale test of this whole effort soon and hopefully it will be out. Another option we explored to get to the scale we wanted was adding in-memory caching to the API server. We had a pair of engineers explore that for a few days and our conclusions were that while it would definitely get us closer to the numbers we wanted and even hit the numbers we wanted, it wouldn't fix the underlying problem that the data model was not... We were modeling the data forcefully to match our data store rather than having it flow naturally. That's why we decided not to go with it. But it's definitely something we still have in our pocket in case we need to optimize something else in the future. The other thing we can also explore in the future is adding a read replica of the API server because currently the API server is a leader-elected single node for reads and writes. It makes sense for us to have writes be going through one server because we want it to be consistent, but we could definitely expand the read aspect of it to scale horizontally too. That's another thing we have in our pockets for the future. That's all we had. Here are some links and resources from the stock and questions? We have a Bosch release. It's like the Diego perfor release out right now and it has all of the tests and what we ran. It's not easy to replicate, but if you've talked to us, we could help you spin it up and figure it out. As we start running the full-scale tests for the now relational backend, we intend to recreate that in a possibly a more reproducible way so people can consume it a little easier. We also have plans to publish these metrics publicly too. Yeah, these tests are on Amazon. We currently have plans to maybe investigate doing a larger scale one on software as well. The question is about application caching when you push. Diego is not part of that system, so all of that happens outside of the Diego boundaries. I don't know if any plans to integrate that. I don't think we're going to do that right now, but as far as I know, that's completely outside of Diego. No, we just wrote all the queries by hand because we felt an ORM was too heavy weight for the queries that we were making. We are leaving that option for the future. So the question is about if we explored big table implementations? No, we decided to go with the traditional solution because we were already coming from a non-standard solution for the problem. Our data model is actually fairly simple. Now that we have this single write BBS server, we can actually generate these events manually by hand in a higher layer. It's actually a design goal that we're trying to move forward with is moving away from these LCD watches because we tied ourselves very closely to LCD with these watches, and now that we've kind of started to break away from that, we can now have more freedom in the backing layer, essentially. That's not a problem we've solved quite yet, but currently we have a master elected database write server. I think we were going to keep it that way for the time being. We may investigate having read replicas that will read from the database, but I don't think we're going to solve the multi-write node problem at this point in time. Really cool stuff. Question, did you ever play with an in-memory database like Gemfire as the backing store for the BBS since durability really isn't a concern? So we haven't actually explored that, but we considered it. That's the short answer. Cool. We'll be around if you guys want to ask us questions. Thank you for listening.